Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
The observable functionality that the user now has as a result of receiving this feature. Complete during New status.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
https://docs.google.com/document/d/1m6OYdz696vg1v8591v0Ao0_r_iqgsWjjM2UjcR_tIrM/
As a developer, I want to be able to test my serverless function after it's been deployed.
Please add a spike to see if there are dependencies.
Developers can use the the kn func invoke CLI to accomplish this. According to Naina, there is an API, but it's in Go.
As a user, I want to invoke a Serverless function from the developer console. This action should be available as a page and as a modal.
This will be similar to the web terminal proxy, except that no auth headers will be passed to the underlying service.
We need something similar to:
POST /proxy/in-cluster { endpoint: string # Or just service: string ?? tbd. headers: Record<string, string | string[]> body: string timeout: number }
As a user, I want to invoke a Serverless function from the developer console. This action should be available as a page and as a modal.
This story depends on ODC-7273, ODC-7274, and ODC-7288. This story should bring the backend proxy, and the frontend together and finalize the work.
As a user, I want to invoke a Serverless function from the developer console. This action should be available as a page and as a modal.
This story is to evaluate a good UI for this and check this with our PM (Serena) and the Serverless team (Naina and Lance).
Information the form should show:
Cluster administrators need an in-product experience to discover and install new Red Hat offerings that can add high value to developer workflows.
Requirements | Notes | IS MVP |
Discover new offerings in Home Dashboard | Y | |
Access details outlining value of offerings | Y | |
Access step-by-step guide to install offering | N | |
Allow developers to easily find and use newly installed offerings | Y | |
Support air-gapped clusters | Y |
< What are we making, for who, and why/what problem are we solving?>
Discovering solutions that are not available for installation on cluster
No known dependencies
Background, and strategic fit
None
Quick Starts
Developers using Dev Console need to be made aware of the RH developer tooling available to them.
Provide awareness to developers using Dev Console of the RH developer tooling that is available to them, including:
Consider enhancing the +Add page and/or the Guided tour
Provide a Quick Start for installing the Cryostat Operator
To increase usage of our RH portfolio
Add below IDE extensions in create serverless form,
This issue is to handle the PR comment - https://github.com/openshift/console-operator/pull/770#pullrequestreview-1501727662 for the issue https://issues.redhat.com/browse/ODC-7292
Update Terminal step of the Guided Tour to indicate that odo CLI is accessible - https://developers.redhat.com/products/odo/overview
We are deprecating DeploymentConfig with Deployment in OpenShift because Deployment is the recommended way to deploy applications. Deployment is a more flexible and powerful resource that allows you to control the deployment of your applications more precisely. DeploymentConfig is a legacy resource that is no longer necessary. We will continue to support DeploymentConfig for a period of time, but we encourage you to migrate to Deployment as soon as possible.
Here are some of the benefits of using Deployment over DeploymentConfig:
We hope that you will migrate to Deployment as soon as possible. If you have any questions, please contact us.
Given the nature of this component (embedded into a shared api server and controller manager), this will likely require adding logic within those shared components to not enable specific bits of function when the build or DeploymentConfig capability is disabled, and watching the enabled capability set so that the components enable the functionality when necessary.
I would not expect us to split the components out of their existing location as part of this, though that is theoretically an option.
None
Make the list of enabled/disable controllers in OAS reflect enabled/disabled capabilities.
Acceptance criteria:
QE:
At the moment, HyperShift is relying on an older etcd operator (i.e, the CoreOS etcd operator). However, this operator is basic and does not support HA as required.
Introduce a reliable component to operate Etcd that:
For an initial MVP of service delivery adoption of Hypershift we need to enable support for manual cluster migration.
Additional information: https://docs.google.com/presentation/d/1JDfd34jvj_4VvVn1bNieSXRejbFqAs_g8G-5rBqTtxw/edit?usp=sharing
Following on from https://issues.redhat.com/browse/HOSTEDCP-444 we need to add the steps to enable migration of the Node/CAPI resources to enable workloads to continue running during controlplane migration.
This will be a manual process where controlplane downtime will occur.
This must satisfy a successful migration criteria:
We need to validate and document this manually for starters.
Eventually this should be automated in the upcoming e2e test.
We could even have a job running conformance tests over a migrated cluster
As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.
As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.
Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.
Bare metal related work:
CoreOS Afterburn:
https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28
https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34
As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.
As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.
Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.
Bare metal related work:
CoreOS Afterburn:
https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28
https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34
{}USER STORY:{}
As an OpenShift administrator, I want to apply an IP configuration so that I can adhere to my organizations security guidelines.
{}DESCRIPTION:{}
The vSphere machine controller needs to be modified to convert nmstate to `guestinfo.afterburn.initrd.network-kargs` upon cloning the template for a new machine. An example of this is here: https://github.com/openshift/machine-api-operator/pull/1079
{}Required:{}
{}Nice to have:{}
{}ACCEPTANCE CRITERIA:{}
{}ENGINEERING DETAILS:{}
Authentication-operator ignores noproxy settings defined in the cluster-wide proxy.
Expected outcome: When noproxy is set, Authentication operator should initialize connections through ingress instead of the cluster-wide proxy.
Currently in OpenShift we do not support adding 3rd party agents and other software to cluster nodes. While rpm-ostree supports adding packages, we have no way today to do that in a sane, scalable way across machineconfigpools and clusters. Some customers may not be able to meet their IT policies due to this.
In addition to third party content, some customers may want to use the layering process as a point to inject configuration. The build process allows for simple copying of config files and the ability to run arbitrary scripts to set user config files (e.g. through an Ansible playbook). This should be a supported use case, except where it conflicts with OpenShift (for example, the MCO must continue to manage Cri-O and Kubelet configs).
As part of enabling OCP CoreOS Layering for third party components, we will need to allow for package installation to /opt. Many OEMs and ISVs install to /opt and it would be difficult for them to make the change only for RHCOS. Meanwhile changing their RHEL target to a different target would also be problematic as their customers are expecting these tools to install in a certain way. Not having to worry about this path will provide the best ecosystem partner and customer experience.
e2e test in our ci to override kernel
Possibly repurpose https://github.com/openshift/os/tree/master/tests/layering
Add support for custom security groups to be attached to control plane and compute nodes at installation time.
Allow the user to provide existing security groups to be attached to the control plane and compute node instances at installation time.
The user will be able to provide a list of existing security groups to the install config manifest that will be used as additional custom security groups to be attached to the control plane and compute node instances at installation time.
The installer won't be responsible of creating any custom security groups, these must be created by the user before the installation starts.
We do have users/customers with specific requirements on adding additional network rules to every instance created in AWS. For OpenShift these additional rules need to be added on day-2 manually as the Installer doesn't provide the ability to add custom security groups to be attached to any instance at install time.
MachineSets already support adding a list of existing custom security groups, so this could be automated already at install time manually editing each MachineSet manifest before starting the installation, but even for these cases the Installer doesn't allow the user to provide this information to add the list of these security groups to the MachineSet manifests.
Documentation will be required to explain how this information needs to be provided to the install config manifest as any other supported field.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
additionalSecurityGroupIDs:
description: AdditionalSecurityGroupIDs contains IDs of
additional security groups for machines, where each ID
is presented in the format sg-xxxx.
items:
type: string
type: array
This requires/does not require a design proposal.
Scaling of pod in Openshift highly depends on customer workload and their hardware setup . Some workloads on certain hardware might not scale beyond 100 pods and others might scale to 1000 pods .
As a openshift admin i want to monitor metrics that will indicate why i am not able to scale my pods . think of pressure gauge that will tell customer when its green ( can scale) when its red ( not scale)
As a openshift support team if a customer call in with their complain about pod scaling then i should be able to check some metrics and inform them why they are not able to scale
Metrics and alert and dashboard
able to integrate these metrics and alert in a monitoring dashboard
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
We need to have an operator to inject dashboard jsonnet. E.g. etcd team injects their dashboard jsonnet using their operator in the form of a config map.
We will need similar approach for node dashboard.
Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
This epic covers the work to apply user defined labels GCP resources created for openshift cluster available as tech preview.
The user should be able to define GCP labels to be applied on the resources created during cluster creation by the installer and other operators which manages the specific resources. The user will be able to define the required tags/labels in the install-config.yaml while preparing with the user inputs for cluster creation, which will then be made available in the status sub-resource of Infrastructure custom resource which cannot be edited but will be available for user reference and will be used by the in-cluster operators for labeling when the resources are created.
Updating/deleting of labels added during cluster creation or adding new labels as Day-2 operation is out of scope of this epic.
List any affected packages or components.
Reference - https://issues.redhat.com/browse/RFE-2017
Enhancement proposed for Azure tags support in OCP, requires install-config CRD to be updated to include gcp userLabels for user to configure, which will be referred by the installer to apply the list of labels on each resource created by it and as well make it available in the Infrastructure CR created.
Below is the snippet of the change required in the CRD
apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: installconfigs.install.openshift.io spec: versions: - name: v1 schema: openAPIV3Schema: properties: platform: properties: gcp: properties: userLabels: additionalProperties: type: string description: UserLabels additional keys and values that the installer will add as labels to all resources that it creates. Resources created by the cluster itself may not include these labels. type: object
This change is required for testing the changes of the feature, and should ideally get merged first.
Acceptance Criteria
Enhancement proposed for GCP labels and tags support in OCP requires making use of latest APIs made available in terraform provider for google and requires an update to use the same.
Acceptance Criteria
Enhancement proposed for GCP tags support in OCP, requires cluster-image-registry-operator to add gcp userTags available in the status sub resource of infrastructure CR, to the gcp storage resource created.
cluster-image-registry-operator uses the method createStorageAccount() to create storage resource which should be updated to add tags after resource creation.
Acceptance Criteria
cluster-config-operator makes Infrastructure CRD available for installer, which is included in it's container image from the openshift/api package and requires the package to be updated to have the latest CRD.
Installer creates below list of gcp resources during create cluster phase and these resources should be applied with the user defined labels and the default OCP label kubernetes-io-cluster-<cluster_id>:owned
Resources List
Resource | Terraform API |
---|---|
VM Instance | google_compute_instance |
Image | google_compute_image |
Address | google_compute_address(beta) |
ForwardingRule | google_compute_forwarding_rule(beta) |
Zones | google_dns_managed_zone |
Storage Bucket | google_storage_bucket |
Acceptance Criteria:
Enhancement proposed for GCP labels support in OCP, requires cluster-image-registry-operator to add gcp userLabels available in the status sub resource of infrastructure CR, to the gcp storage resource created.
cluster-image-registry-operator uses the method createStorageAccount() to create storage resource which should be updated to add labels.
Acceptance Criteria
Enhancement proposed for GCP labels support in OCP, requires machine-api-provider-gcp to add azure userLabels available in the status sub resource of infrastructure CR, to the gcp virtual machines resource and the sub-resources created.
Acceptance Criteria
Installer generates Infrastructure CR in manifests creation step of cluster creation process based on the user provided input recorded in install-config.yaml. While generating Infrastructure CR platformStatus.gcp.resourceLabels should be updated with the user provided labels(installconfig.platform.gcp.userLabels).
Acceptance Criteria
Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage AWS STS authorization when using AWS APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.
Enable customers to easily leverage OpenShift's capabilities around AWS STS with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.
See Operators & STS slide deck.
The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens. This capability should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with AWS APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on AWS.
This is particularly important for ROSA customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.
This Section: High-Level description of the Market Problem ie: Executive Summary
This Section: Articulates and defines the value proposition from a users point of view
This Section: Effect is the expected outcome within the market. There are two dimensions of outcomes; growth or retention. This represents part of the “why” statement for a feature.
As an engineer I want the capability to implement CI test cases that run at different intervals, be it daily, weekly so as to ensure downstream operators that are dependent on certain capabilities are not negatively impacted if changes in systems CCO interacts with change behavior.
Acceptance Criteria:
Create a stubbed out e2e test path in CCO and matching e2e calling code in release such that there exists a path to tests that verify working in an AWS STS workflow.
OC mirror is GA product as of Openshift 4.11 .
The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror
In 4.12 release, a new feature was introduced to oc-mirror allowing it to use OCI FBC catalogs as starting point for mirroring operators.
As a oc-mirror user, I would like the OCI FBC feature to be stable
so that I can use it in a production ready environment
and to make the new feature and all existing features of oc-mirror seamless
This feature is ring-fenced in the oc mirror repository, it uses the following flags to achieve this so as not to cause any breaking changes in the current oc-mirror functionality.
The OCI FBC (file base container) format has been delivered for Tech Preview in 4.12
Tech Enablement slides can be found here https://docs.google.com/presentation/d/1jossypQureBHGUyD-dezHM4JQoTWPYwiVCM3NlANxn0/edit#slide=id.g175a240206d_0_7
Design doc is in https://docs.google.com/document/d/1-TESqErOjxxWVPCbhQUfnT3XezG2898fEREuhGena5Q/edit#heading=h.r57m6kfc2cwt (also contains latest design discussions around the stories of this epic)
Link to previous working epic https://issues.redhat.com/browse/CFE-538
Contacts for the OCI FBC feature
The OpenShift Assisted Installer is a user-friendly OpenShift installation solution for the various platforms, but focused on bare metal. This very useful functionality should be made available for the IBM zSystem platform.
Use of the OpenShift Assisted Installer to install OpenShift on an IBM zSystem
Using the OpenShift Assisted Installer to install OpenShift on an IBM zSystem
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a multi-arch development engineer, I would like to ensure that the Assisted Installer workflow is fully functional and supported for z/VM deployments.
Acceptance Criteria
Description of the problem:
Using FCP (multipath) devices for zVM node
parmline:
rd.neednet=1 console=ttysclp0 coreos.live.rootfs_url=http://172.23.236.156:8080/assisted-installer/rootfs.img ip=10.14.6.8::10.14.6.1:255.255.255.0:master-0:encbdd0:none nameserver=10.14.6.1 ip=[fd00::8]::[fd00::1]:64::encbdd0:none nameserver=[fd00::1] zfcp.allow_lun_scan=0 rd.znet=qeth,0.0.bdd0,0.0.bdd1,0.0.bdd2,layer2=1 rd.zfcp=0.0.8007,0x500507630400d1e3,0x4000401e00000000 rd.zfcp=0.0.8107,0x50050763040851e3,0x4000401e00000000 random.trust_cpu=on rd.luks.options=discard ignition.firstboot ignition.platform.id=metal console=tty1 console=ttyS1,115200n8
shows disk limitation error in the UI.
<see attached image>
How reproducible:
Attach two FCP devices to a zVM node. Create a cluster and boot zVM node into discovery service. Host discovery panel shows an error for discovered host.
Steps to reproduce:
1. Attach two FCP devices to the zVM.
2. Create new cluster using the AI UI and configure discovery image
3. Boot zVM node
4. Waiting until node is showing up on the Host discovery panel.
5. FCP devices are not recognized as valid option
Actual results:
FCP devices can't be used as installable disk
Expected results:
FCP device can be used for installation (multipath must be activated after installation:
https://docs.openshift.com/container-platform/4.13/post_installation_configuration/ibmz-post-install.html#enabling-multipathing-fcp-luns_post-install-configure-additional-devices-ibmz)
Discovering an regression on staging where default is set to minimal ISO preventing installation of OCP 4.13 for s390x architecture.
See following older bugs addressing the same issue I guess
Description of the problem:
Using DASD devices are not recognized correctly if attached and used for a zVM node.
<see attached screenshot>
Attach two FCP devices to a zVM node. Create a cluster and boot zVM node into discovery service. Host discovery panel shows an error for discovered host.
Steps to reproduce:
1. Attach two DASD devices to the zVM.
2. Create new cluster using the AI UI and configure discovery image
3. Boot zVM node
4. Waiting until node is showing up on the Host discovery panel.
5. DASD devices are not recognized as valid option
Actual results:
DASD devices can't be used as installable disk
Expected results:
DASD device can be used for installation. User can choose the on which device AI will install to.
As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers
Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy.
vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409
As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers
Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy.
vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409
Notes: https://github.com/EmilienM/ansible-role-routed-lb is an example of a LB that will be used for CI, can be used by QE and customers.
As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers
Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy.
vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
<!--
Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:
https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/
As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.
Before submitting it, please make sure to remove all comments like this one.
-->
{}USER STORY:{}
<!--
One sentence describing this story from an end-user perspective.
-->
As a [type of user], I want [an action] so that [a benefit/a value].
{}DESCRIPTION:{}
<!--
Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.
-->
{}Required:{}
...
{}Nice to have:{}
...
{}ACCEPTANCE CRITERIA:{}
<!--
Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.
-->
{}ENGINEERING DETAILS:{}
<!--
Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.
-->
Testing is one of the main pillars of production-grade software. It helps validate and flag issues early on before the code is shipped into productive landscapes. Code changes no matter how small they are might lead to bugs and outages, the best way to validate bugs is to write proper tests, and to run those tests we need to have a foundation for a test infrastructure, finally, to close the circle, automation of these tests and their corresponding build help reduce errors and save a lot of time.
Note: Sync with the Developer productivity teams might be required to understand infra requirements especially for our first HyperShift infrastructure backend, AWS.
Context:
This is a placeholder epic to capture all the e2e scenarios that we want to test in CI in the long term. Anything which is a TODO here should at minimum be validated by QE as it is developed.
DoD:
Every supported scenario is e2e CI tested.
Scenarios:
DoD:
Refactor the E2E tests following new pattern with 1 HostedCluster and targeted NodePools:
Goal
Productize agent-installer-utils container from https://github.com/openshift/agent-installer-utils
Feature Description
In order to ship the network reconfiguration it would be useful to move the agent-tui to its own image instead of sharing the agent-installer-node-agent one.
Goal
Productize agent-installer-utils container from https://github.com/openshift/agent-installer-utils
Feature Description
In order to ship the network reconfiguration it would be useful to move the agent-tui to its own image instead of sharing the agent-installer-node-agent one.
Currently the `agent create image` command takes care to extract the agent-tui binary (and required libs) from the `assisted-installer-agent` image (shipped in the release as `agent-installer-node-agent`).
Once the agent-tui will be available instead from the `agent-installer-utils` image, it would be necessary to update accordingly the installer code (see https://github.com/openshift/installer/blob/56e85bee78490c18aaf33994e073cbc16181f66d/pkg/asset/agent/image/agentimage.go#L81)
agent-tui is currently built and shipped using the assisted-installer-agent repo. Since it will be move into its own repository (agent-installer-utils), it's necessary to cleanup the previous code.
Allow users to interactively adjust the network configuration for a host after booting the agent ISO.
Configure network after host boots
The user has Static IPs, VLANs, and/or bonds to configure, but has no idea of the device names of the NICs. They don't enter any network config in agent-config.yaml. Instead they configure each host's network via the text console after it boots into the image.
Currently the agent-tui displays always the additional checks (nslookup/ping/http get), even when the primary check (pull image) passes. This may cause some confusion to the user, due the fact that the additional checks do not prevent the agent-tui to complete successfully but they are just informative, to allow a better troubleshooting of the issue (so not needed in the positive case).
The additional checks should then be shown only when the primary check fails for any reason.
When the UI is active in the console events messages that are generated will distort the interface and make it difficult for the user to view the configuration and select options. An example is shown in the attached screenshot.
When the agent-tui is shown during the initial host boot, if the pull release image check fails then an additional checks box is shown along with a details text view.
The content of the details view gets continuosly updated with the details of failed check, but the user cannot move the focus over the details box (using the arrow/tab keys), thus cannot scroll its content (using the up/down arrow keys)
The openshift-install agent create image will need to fetch the agent-tui executable so that it could be embedded within the agent ISO. For this reason the agent-tui must be available in the release payload, so that it could be retrieved even when the command is invoked in a disconnected environment.
Full support of North-South (cluster egress-ingress) IPsec that shares an encryption back-end with the current East-West implementation, allows for IPsec offload to capable SmartNICs, can be enabled and disabled at runtime, and allows for FIPS compliance (including install-time configuration and disabling of runtime configuration).
This is a clone of issue OCPBUGS-17380. The following is the description of the original issue:
—
Description of problem:
Enable IPSec pre/post install on OVN IC cluster $ oc patch networks.operator.openshift.io cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipsecConfig":{ }}}}}' network.operator.openshift.io/cluster patched ovn-ipsec containers complaining: ovs-monitor-ipsec | ERR | Failed to import certificate into NSS. b'certutil: unable to open "/etc/openvswitch/keys/ipsec-cacert.pem" for reading (-5950, 2).\n' $ oc rsh ovn-ipsec-d7rx9 Defaulted container "ovn-ipsec" out of: ovn-ipsec, ovn-keys (init) sh-5.1# certutil -L -d /var/lib/ipsec/nss Certificate Nickname Trust Attributes SSL,S/MIME,JAR/XPIovs_certkey_db961f9a-7de4-4f1d-a2fb-a8306d4079c5 u,u,u sh-5.1# cat /var/log/openvswitch/libreswan.log Aug 4 15:12:46.808394: Initializing NSS using read-write database "sql:/var/lib/ipsec/nss" Aug 4 15:12:46.837350: FIPS Mode: NO Aug 4 15:12:46.837370: NSS crypto library initialized Aug 4 15:12:46.837387: FIPS mode disabled for pluto daemon Aug 4 15:12:46.837390: FIPS HMAC integrity support [disabled] Aug 4 15:12:46.837541: libcap-ng support [enabled] Aug 4 15:12:46.837550: Linux audit support [enabled] Aug 4 15:12:46.837576: Linux audit activated Aug 4 15:12:46.837580: Starting Pluto (Libreswan Version 4.9 IKEv2 IKEv1 XFRM XFRMI esp-hw-offload FORK PTHREAD_SETSCHEDPRIO GCC_EXCEPTIONS NSS (IPsec profile) (NSS-KDF) DNSSEC SYSTEMD_WATCHDOG LABELED_IPSEC (SELINUX) SECCOMP LIBCAP_NG LINUX_AUDIT AUTH_PAM NETWORKMANAGER CURL(non-NSS) LDAP(non-NSS)) pid:147 Aug 4 15:12:46.837583: core dump dir: /run/pluto Aug 4 15:12:46.837585: secrets file: /etc/ipsec.secrets Aug 4 15:12:46.837587: leak-detective enabled Aug 4 15:12:46.837589: NSS crypto [enabled] Aug 4 15:12:46.837591: XAUTH PAM support [enabled] Aug 4 15:12:46.837604: initializing libevent in pthreads mode: headers: 2.1.12-stable (2010c00); library: 2.1.12-stable (2010c00) Aug 4 15:12:46.837664: NAT-Traversal support [enabled] Aug 4 15:12:46.837803: Encryption algorithms: Aug 4 15:12:46.837814: AES_CCM_16 {256,192,*128} IKEv1: ESP IKEv2: ESP FIPS aes_ccm, aes_ccm_c Aug 4 15:12:46.837820: AES_CCM_12 {256,192,*128} IKEv1: ESP IKEv2: ESP FIPS aes_ccm_b Aug 4 15:12:46.837826: AES_CCM_8 {256,192,*128} IKEv1: ESP IKEv2: ESP FIPS aes_ccm_a Aug 4 15:12:46.837831: 3DES_CBC [*192] IKEv1: IKE ESP IKEv2: IKE ESP FIPS NSS(CBC) 3des Aug 4 15:12:46.837837: CAMELLIA_CTR {256,192,*128} IKEv1: ESP IKEv2: ESP Aug 4 15:12:46.837843: CAMELLIA_CBC {256,192,*128} IKEv1: IKE ESP IKEv2: IKE ESP NSS(CBC) camellia Aug 4 15:12:46.837849: AES_GCM_16 {256,192,*128} IKEv1: ESP IKEv2: IKE ESP FIPS NSS(GCM) aes_gcm, aes_gcm_c Aug 4 15:12:46.837855: AES_GCM_12 {256,192,*128} IKEv1: ESP IKEv2: IKE ESP FIPS NSS(GCM) aes_gcm_b Aug 4 15:12:46.837861: AES_GCM_8 {256,192,*128} IKEv1: ESP IKEv2: IKE ESP FIPS NSS(GCM) aes_gcm_a Aug 4 15:12:46.837867: AES_CTR {256,192,*128} IKEv1: IKE ESP IKEv2: IKE ESP FIPS NSS(CTR) aesctr Aug 4 15:12:46.837872: AES_CBC {256,192,*128} IKEv1: IKE ESP IKEv2: IKE ESP FIPS NSS(CBC) aes Aug 4 15:12:46.837878: NULL_AUTH_AES_GMAC {256,192,*128} IKEv1: ESP IKEv2: ESP FIPS aes_gmac Aug 4 15:12:46.837883: NULL [] IKEv1: ESP IKEv2: ESP Aug 4 15:12:46.837889: CHACHA20_POLY1305 [*256] IKEv1: IKEv2: IKE ESP NSS(AEAD) chacha20poly1305 Aug 4 15:12:46.837892: Hash algorithms: Aug 4 15:12:46.837896: MD5 IKEv1: IKE IKEv2: NSS Aug 4 15:12:46.837901: SHA1 IKEv1: IKE IKEv2: IKE FIPS NSS sha Aug 4 15:12:46.837906: SHA2_256 IKEv1: IKE IKEv2: IKE FIPS NSS sha2, sha256 Aug 4 15:12:46.837910: SHA2_384 IKEv1: IKE IKEv2: IKE FIPS NSS sha384 Aug 4 15:12:46.837915: SHA2_512 IKEv1: IKE IKEv2: IKE FIPS NSS sha512 Aug 4 15:12:46.837919: IDENTITY IKEv1: IKEv2: FIPS Aug 4 15:12:46.837922: PRF algorithms: Aug 4 15:12:46.837927: HMAC_MD5 IKEv1: IKE IKEv2: IKE native(HMAC) md5 Aug 4 15:12:46.837931: HMAC_SHA1 IKEv1: IKE IKEv2: IKE FIPS NSS sha, sha1 Aug 4 15:12:46.837936: HMAC_SHA2_256 IKEv1: IKE IKEv2: IKE FIPS NSS sha2, sha256, sha2_256 Aug 4 15:12:46.837950: HMAC_SHA2_384 IKEv1: IKE IKEv2: IKE FIPS NSS sha384, sha2_384 Aug 4 15:12:46.837955: HMAC_SHA2_512 IKEv1: IKE IKEv2: IKE FIPS NSS sha512, sha2_512 Aug 4 15:12:46.837959: AES_XCBC IKEv1: IKEv2: IKE native(XCBC) aes128_xcbc Aug 4 15:12:46.837962: Integrity algorithms: Aug 4 15:12:46.837966: HMAC_MD5_96 IKEv1: IKE ESP AH IKEv2: IKE ESP AH native(HMAC) md5, hmac_md5 Aug 4 15:12:46.837984: HMAC_SHA1_96 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS sha, sha1, sha1_96, hmac_sha1 Aug 4 15:12:46.837995: HMAC_SHA2_512_256 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS sha512, sha2_512, sha2_512_256, hmac_sha2_512 Aug 4 15:12:46.837999: HMAC_SHA2_384_192 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS sha384, sha2_384, sha2_384_192, hmac_sha2_384 Aug 4 15:12:46.838005: HMAC_SHA2_256_128 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS sha2, sha256, sha2_256, sha2_256_128, hmac_sha2_256 Aug 4 15:12:46.838008: HMAC_SHA2_256_TRUNCBUG IKEv1: ESP AH IKEv2: AH Aug 4 15:12:46.838014: AES_XCBC_96 IKEv1: ESP AH IKEv2: IKE ESP AH native(XCBC) aes_xcbc, aes128_xcbc, aes128_xcbc_96 Aug 4 15:12:46.838018: AES_CMAC_96 IKEv1: ESP AH IKEv2: ESP AH FIPS aes_cmac Aug 4 15:12:46.838023: NONE IKEv1: ESP IKEv2: IKE ESP FIPS null Aug 4 15:12:46.838026: DH algorithms: Aug 4 15:12:46.838031: NONE IKEv1: IKEv2: IKE ESP AH FIPS NSS(MODP) null, dh0 Aug 4 15:12:46.838035: MODP1536 IKEv1: IKE ESP AH IKEv2: IKE ESP AH NSS(MODP) dh5 Aug 4 15:12:46.838039: MODP2048 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS(MODP) dh14 Aug 4 15:12:46.838044: MODP3072 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS(MODP) dh15 Aug 4 15:12:46.838048: MODP4096 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS(MODP) dh16 Aug 4 15:12:46.838053: MODP6144 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS(MODP) dh17 Aug 4 15:12:46.838057: MODP8192 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS(MODP) dh18 Aug 4 15:12:46.838061: DH19 IKEv1: IKE IKEv2: IKE ESP AH FIPS NSS(ECP) ecp_256, ecp256 Aug 4 15:12:46.838066: DH20 IKEv1: IKE IKEv2: IKE ESP AH FIPS NSS(ECP) ecp_384, ecp384 Aug 4 15:12:46.838070: DH21 IKEv1: IKE IKEv2: IKE ESP AH FIPS NSS(ECP) ecp_521, ecp521 Aug 4 15:12:46.838074: DH31 IKEv1: IKE IKEv2: IKE ESP AH NSS(ECP) curve25519 Aug 4 15:12:46.838077: IPCOMP algorithms: Aug 4 15:12:46.838081: DEFLATE IKEv1: ESP AH IKEv2: ESP AH FIPS Aug 4 15:12:46.838085: LZS IKEv1: IKEv2: ESP AH FIPS Aug 4 15:12:46.838089: LZJH IKEv1: IKEv2: ESP AH FIPS Aug 4 15:12:46.838093: testing CAMELLIA_CBC: Aug 4 15:12:46.838096: Camellia: 16 bytes with 128-bit key Aug 4 15:12:46.838162: Camellia: 16 bytes with 128-bit key Aug 4 15:12:46.838201: Camellia: 16 bytes with 256-bit key Aug 4 15:12:46.838243: Camellia: 16 bytes with 256-bit key Aug 4 15:12:46.838280: testing AES_GCM_16: Aug 4 15:12:46.838284: empty string Aug 4 15:12:46.838319: one block Aug 4 15:12:46.838352: two blocks Aug 4 15:12:46.838385: two blocks with associated data Aug 4 15:12:46.838424: testing AES_CTR: Aug 4 15:12:46.838428: Encrypting 16 octets using AES-CTR with 128-bit key Aug 4 15:12:46.838464: Encrypting 32 octets using AES-CTR with 128-bit key Aug 4 15:12:46.838502: Encrypting 36 octets using AES-CTR with 128-bit key Aug 4 15:12:46.838541: Encrypting 16 octets using AES-CTR with 192-bit key Aug 4 15:12:46.838576: Encrypting 32 octets using AES-CTR with 192-bit key Aug 4 15:12:46.838613: Encrypting 36 octets using AES-CTR with 192-bit key Aug 4 15:12:46.838651: Encrypting 16 octets using AES-CTR with 256-bit key Aug 4 15:12:46.838687: Encrypting 32 octets using AES-CTR with 256-bit key Aug 4 15:12:46.838724: Encrypting 36 octets using AES-CTR with 256-bit key Aug 4 15:12:46.838763: testing AES_CBC: Aug 4 15:12:46.838766: Encrypting 16 bytes (1 block) using AES-CBC with 128-bit key Aug 4 15:12:46.838801: Encrypting 32 bytes (2 blocks) using AES-CBC with 128-bit key Aug 4 15:12:46.838841: Encrypting 48 bytes (3 blocks) using AES-CBC with 128-bit key Aug 4 15:12:46.838881: Encrypting 64 bytes (4 blocks) using AES-CBC with 128-bit key Aug 4 15:12:46.838928: testing AES_XCBC: Aug 4 15:12:46.838932: RFC 3566 Test Case 1: AES-XCBC-MAC-96 with 0-byte input Aug 4 15:12:46.839126: RFC 3566 Test Case 2: AES-XCBC-MAC-96 with 3-byte input Aug 4 15:12:46.839291: RFC 3566 Test Case 3: AES-XCBC-MAC-96 with 16-byte input Aug 4 15:12:46.839444: RFC 3566 Test Case 4: AES-XCBC-MAC-96 with 20-byte input Aug 4 15:12:46.839600: RFC 3566 Test Case 5: AES-XCBC-MAC-96 with 32-byte input Aug 4 15:12:46.839756: RFC 3566 Test Case 6: AES-XCBC-MAC-96 with 34-byte input Aug 4 15:12:46.839937: RFC 3566 Test Case 7: AES-XCBC-MAC-96 with 1000-byte input Aug 4 15:12:46.840373: RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 16) Aug 4 15:12:46.840529: RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 10) Aug 4 15:12:46.840698: RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 18) Aug 4 15:12:46.840990: testing HMAC_MD5: Aug 4 15:12:46.840997: RFC 2104: MD5_HMAC test 1 Aug 4 15:12:46.841200: RFC 2104: MD5_HMAC test 2 Aug 4 15:12:46.841390: RFC 2104: MD5_HMAC test 3 Aug 4 15:12:46.841582: testing HMAC_SHA1: Aug 4 15:12:46.841585: CAVP: IKEv2 key derivation with HMAC-SHA1 Aug 4 15:12:46.842055: 8 CPU cores online Aug 4 15:12:46.842062: starting up 7 helper threads Aug 4 15:12:46.842128: started thread for helper 0 Aug 4 15:12:46.842174: helper(1) seccomp security disabled for crypto helper 1 Aug 4 15:12:46.842188: started thread for helper 1 Aug 4 15:12:46.842219: helper(2) seccomp security disabled for crypto helper 2 Aug 4 15:12:46.842236: started thread for helper 2 Aug 4 15:12:46.842258: helper(3) seccomp security disabled for crypto helper 3 Aug 4 15:12:46.842269: started thread for helper 3 Aug 4 15:12:46.842296: helper(4) seccomp security disabled for crypto helper 4 Aug 4 15:12:46.842311: started thread for helper 4 Aug 4 15:12:46.842323: helper(5) seccomp security disabled for crypto helper 5 Aug 4 15:12:46.842346: started thread for helper 5 Aug 4 15:12:46.842369: helper(6) seccomp security disabled for crypto helper 6 Aug 4 15:12:46.842376: started thread for helper 6 Aug 4 15:12:46.842390: using Linux xfrm kernel support code on #1 SMP PREEMPT_DYNAMIC Thu Jul 20 09:11:28 EDT 2023 Aug 4 15:12:46.842393: helper(7) seccomp security disabled for crypto helper 7 Aug 4 15:12:46.842707: selinux support is NOT enabled. Aug 4 15:12:46.842728: systemd watchdog not enabled - not sending watchdog keepalives Aug 4 15:12:46.843813: seccomp security disabled Aug 4 15:12:46.848083: listening for IKE messages Aug 4 15:12:46.848252: Kernel supports NIC esp-hw-offload Aug 4 15:12:46.848534: adding UDP interface ovn-k8s-mp0 10.129.0.2:500 Aug 4 15:12:46.848624: adding UDP interface ovn-k8s-mp0 10.129.0.2:4500 Aug 4 15:12:46.848654: adding UDP interface br-ex 169.254.169.2:500 Aug 4 15:12:46.848681: adding UDP interface br-ex 169.254.169.2:4500 Aug 4 15:12:46.848713: adding UDP interface br-ex 10.0.0.8:500 Aug 4 15:12:46.848740: adding UDP interface br-ex 10.0.0.8:4500 Aug 4 15:12:46.848767: adding UDP interface lo 127.0.0.1:500 Aug 4 15:12:46.848793: adding UDP interface lo 127.0.0.1:4500 Aug 4 15:12:46.848824: adding UDP interface lo [::1]:500 Aug 4 15:12:46.848853: adding UDP interface lo [::1]:4500 Aug 4 15:12:46.851160: loading secrets from "/etc/ipsec.secrets" Aug 4 15:12:46.851214: no secrets filename matched "/etc/ipsec.d/*.secrets" Aug 4 15:12:47.053369: loading secrets from "/etc/ipsec.secrets" sh-4.4# tcpdump -i any esp dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes^C 0 packets capturedsh-5.1# ovn-nbctl --no-leader-only get nb_global . ipsec false
Version-Release number of selected component (if applicable):
openshift/cluster-network-operator#1874
How reproducible:
Always
Steps to Reproduce:
1.Install OVN cluster and enable IPSec in runtime 2. 3.
Actual results:
no esp packets seen across the nodes
Expected results:
esp traffic should be seen across the nodes
Additional info:
OC mirror is GA product as of Openshift 4.11 .
The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror
Overview
This epic is a simple tracker epic for the proposed work and analysis for 4.14 delivery
As a oc-mirror user, I would like mirrored operator catalogs to have valid caches that reflect the contents of the catalog (configs folder) based on the filtering done in the ImageSetConfig for that catalog
so that the catalog image starts efficiently in a cluster.
Tasks:
opm serve /configs –-cache-dir /tmp/cache –-cache-only
Acceptance criteria:
Description of problem:
Customer was able to limit the nested repository path with "oc adm catalog mirror" by using the argument "--max-components" but there is no alternate solution along with "oc-mirror" binary while we are suggesting to use "oc-mirror" binary for mirroring.for example: Mirroring will work if we mirror like below oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy Mirroring will fail with 401 unauthorized if we add one more nested path like below oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy/zzz
Version-Release number of selected component (if applicable):
How reproducible:
We can reproduce the issue by using a repository which is not supported deep nested paths
Steps to Reproduce:
1. Create a imageset to mirror any operator kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: local: path: ./oc-mirror-metadata mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12 packages: - name: local-storage-operator channels: - name: stable 2. Do the mirroring to a registry where its not supported deep nested repository path, Here its gitlab and its doesnt not support netsting beyond 3 levels deep. oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy/zzz this mirroring will fail with 401 unauthorized error 3. if try to mirror the same imageset by removing one path it will work without any issues, like below oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy
Actual results:
Expected results:
Need a alternative option of "--max-components" to limit the nested path in "oc-mirror"
Additional info:
Achieve feature parity for recently introduced functionality for all modes of operation
Currently there are gaps in functionality within oc mirror that we would like addressed.
1. Support oci: references within mirror.operators[].catalog in an ImageSetConfiguration when running in all modes of operation with the full functionality provided by oc mirror.
Currently oci: references such as the following are allowed only in limited circumstances:
mirror: operators: - catalog: oci:///tmp/oci/ocp11840 - catalog: icr.io/cpopen/ibm-operator-catalog
Currently supported scenarios
In this mode of operation the images are fetched from the oci: reference rather than being pulled from a source docker image repository. These catalogs are processed through similar (yet different) mechanisms compared to docker image references. The end result in this scenario is that the catalog is potentially modified and images (i.e. catalog, bundle, related images, etc.) are pushed to their final docker image repository. This provides the full capabilities offered by oc mirror (e.g. catalog "filtering", image pruning, metadata manipulation to keep track of what has been mirrored, etc.)
Desired scenarios
In the following scenarios we would like oci: references to be processed in a similar way to how docker references are handled (as close as possible anyway given the different APIs involved). Ultimately we want oci: catalog references to provide the full set of functionality currently available for catalogs provided as a docker image reference. In other words we want full feature parity (e.g. catalog "filtering", image pruning, metadata manipulation to keep track of what has been mirrored, etc.)
In this mode of operation the images are fetched from the oci: reference rather than being pulled from a docker image repository. These catalogs are processed through similar yet different mechanisms compared to docker image references. The end result of this scenario is that all mappings and catalogs are packaged into tar archives (i.e. the "imageset").
In this mode of operation the tar archives (i.e. the "imageset") are processed via the "publish mechanism" which means unpacking the tar archives, processing the metadata, pruning images, rebuilding catalogs, and pushing images to their destination. In theory if the mirror-to-disk scenario is handled properly, then this mode should "just work".
Below the line was the original RFE for requesting the OCI feature and is only provided for reference.
Overview
Design , code and implementation of the mirrorToDisk functionality
... so that I can use that along with the OCI FBC feature
Goal:
As a cluster administrator, I want OpenShift to include a recent HAProxy version, so that I have the latest available performance and security fixes.
Description:
We should strive to follow upstream HAProxy releases by bumping the HAProxy version that we ship in OpenShift with every 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgent fix necessitates bumping to the latest HAProxy release. This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.
For OpenShift 4.13, this means bumping to 2.6.
As a cluster administrator,
I want OpenShift to include a recent HAProxy version,
so that I have the latest available performance and security fixes.
We should strive to follow upstream HAProxy releases by bumping the HAProxy version that we ship in OpenShift with every 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgent fix necessitates bumping to the latest HAProxy release. This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.
For OpenShift 4.14, this means bumping to 2.6.
Bump the HAProxy version in dist-git so that OCP 4.13 ships HAProxy 2.6.13, with this patch added on top: https://git.haproxy.org/?p=haproxy-2.6.git;a=commit;h=2b0aafdc92f691bc4b987300c9001a7cc3fb8d08. The patch fixes the segfault that was being tracked as OCPBUGS-13232.
This patch is in HAProxy 2.6.14, so we can stop carrying the patch once we bump to HAProxy 2.6.14 or newer in a subsequent OCP release.
Tang-enforced, network-bound disk encryption has been available in OpenShift for some time, but all intended Tang-endpoints contributing unique key material to the process must be reachable during RHEL CoreOS provisioning in order to complete deployment.
If a user wants to require 3 of 6 tang servers be reachable than all 6 must be reachable during the provisioning process. This might not be possible due to maintenance, outage, or simply network policy during deployment.
Enabling offline provisioning for first boot will help all of these scenarios.
The user can now provision a cluster with some or none of the Tang servers being reachable on first boot. Second boot, of course, will be subject to the Tang requirements being configured.
Done when:
This requires messy/complex work of grepping through for prior references to ignition and updating golang types that reference other versions.
Assumption that existing tests are sufficient to catch discrepancies.
Goal
Allow to point to an existing OVA image stored in vSphere from the OpenShift installer, replacing the current method that uploads the OVA template every time an OpenShift cluster is installed.
Why is this important?
This is an improvement that makes the installation more efficient by not having to upload an OVA from where openshift-install is running every time a cluster is installed, saving time and bandwidth use. For example if an administrating is installing from a VPN then the OVA is upload through it to the target cluster every time an OpenShift cluster is installed. This makes the administration process more efficient by having a OVA centralised ready to use to install new clusters without uploading it from where the installer is run.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.
To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).
OCPBU-5: Phase 1
OCPBU-510: Phase 2
OCPBU-329: Phase.Next
Phase 1
Phase 2
Phase 3
As a Red Hat Partner installing OpenShift using the External platform type, I would like to install my own Cloud Controller Manager(CCM). Having a field in the Infrastructure configuration object to signal that I will install my own CCM and that Kubernetes should be configured to expect an external CCM will allow me to run my own CCM on new OpenShift deployments.
This work has been defined in the External platform enhancement , and had previously been part of openshift/api . The CCM API pieces were removed for the 4.13 release of OpenShift to ensure that we did not ship unused portions of the API.
In addition to the API changes, library-go will need to have an update to the IsCloudProviderExternal function to detect the if the External platform is selected and if the CCM should be enabled for external mode.
We will also need to check the ObserveCloudVolumePlugin function to ensure that it is not affected by the external changes and that it continues to use the external volume plugin.
After updating openshift/library-go, it will need to be re-vendored into the MCO , KCMO , and CCCMO (although this is not as critical as the other 2).
As a user I want to use the openshift installer to create clusters of platform type External so that I can use openshift more effectively on a partner provider platform.
To fully support the External platform type for partners and users, it will be useful to be able to have the installer understand when it sees the external platform type in the install-config.yaml, and then to properly populate the resulting infrastructure config object with the external platform type and platform name.
As defined in https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L241 , the external platform type allows the user to specify a name for the platform. This card is about updating the installer so that a user can provide both the external type and a platform name that will be expressed in the infrastructure manifest.
Aside from this information, the installer should continue with a normal platform "None" installation.
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
In the context of the Machine Config Operator (MCO) in Red Hat OpenShift, on-cluster builds refer to the process of building an OS image directly on the OpenShift cluster, rather than building them outside the cluster (such as on a local machine or continuous integration (CI) pipeline) and then making a configuration change so that the cluster uses them. By doing this, we enable cluster administrators to have more control over the contents and configuration of their clusters’ OS image through a familiar interface (MachineConfigs and in the future, Dockerfiles).
This is the "consumption" side of the security – rpm-ostree needs to be able to retrieve images from the internal registry seamlessly.
This will involve setting up (or using some existing) pull secrets, and then getting them to the proper location on disk so that rpm-ostree can use them to pull images.
At the layering sync meeting on Thursday, August 10th, it was decided that for this to be considered ready for Dev / Tech Preview, cluster admins need a way to inject custom Dockerfiles into their on-cluster builds.
(Commentary: It was also decided 4 months ago that this was not an MVP requirement in https://docs.google.com/document/d/1QSsq0mCgOSUoKZ2TpCWjzrQpKfMUL9thUFBMaPxYSLY/edit#heading=h.jqagm7kwv0lg. And quite frankly, this requirement should have been known at that point in time as opposed to the week before tech preview.)
The first phase of the layering effort involved creating a BuildController, whose job is to start and manage builds using the OpenShift Build API. We can use the work done to create the BuildController as the basis for our MVP. However, what we need from BuildController right now is less than BuildController currently provides. With that in mind, we need to remove certain parts of BuildController to create a more streamlined and simpler implementation ideal for an MVP.
Done when a version of BuildController is landed which does the following things:
The second phase of the layering effort involved creating a BuildController, whose job is to start and manage builds of OS images. While it should be able to perform those functions on its own, getting the built OS image onto each of the cluster nodes involves modifying other parts of the MCO to be layering-aware. To that end, there are three pieces involve, some of which will require modification:
Right now, the render controller listens for incoming MachineConfig changes. It generates the rendered config which is comprised of all of the MachineConfigs for a given MachineConfigPool. Once rendered, the Render Controller updates the MachineConfigPool to point to the new config. This portion of the MCO will not likely need any modification that I'm aware of at the moment.
The Node Controller listens for MachineConfigPool config changes. Whenever it identifies that a change has occurred, it applies the machineconfiguration.openshift.io/desiredConfig annotation to all the nodes in the targeted MachineConfigPool which causes the Machine Config Daemon (MCD) to apply the new configs. With this new layering mechanism, we'll need to add the additional annotation of machineconfiguration.openshift.io/desiredOSimage which will contain the fully-qualified pullspec for the new OS image (referenced by the image SHA256 sum). To be clear, we will not be replacing the desiredConfig annotation with the desiredOSimage annotation; both will still be used. This will allow Config Drift Monitor to continue to function the way it does with no modification required.
Right now, the MCD listens to Node objects for changes to the machineconfiguration.openshift.io/desiredConfig annotation. With the new desiredOSimage annotation being present, the MCD will need to skip the parts of the update loop which write files and systemd units to disk. Instead, it will skip directly to the rpm-ostree application phase (after making sure the correct pull secrets are in place, etc.).
Done When:
To speed development for on-cluster builds and avoid a lot of complex code paths, the decision was made to put all functionality related to building OS images and managing internal registries into a separate binary within the MCO.
Eventually, this binary will be responsible for running the productionized BuildController and know how to respond to Machine OS Builder API objects. However, until the productionized BuildController and opt-in portions are ready, the first pass of this binary will be much simpler: For now, it can connect to the API server and print a "Hello World".
Done When:
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Fully automated installation creating subnets in AWS Local Zones when the zone names are added to the edge compute pool on install-config.yaml.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:
Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators
Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators
Phase 3 (OpenShift 4.13): OCPBU-117
Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)
Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly) OCPBU-519
Phase 6 (OpenShift 4.16): OCPSTRAT-731
Questions to be addressed:
Per https://github.com/openshift/enhancements/pull/922 we need `oc adm release new` to parse the resource manifests for `capability` annotations and generate a yaml file that lists the valid capability names, to embed in the release image.
This file can be used by the installer to error or warn when the install config lists capabilities for enable/disable that are not valid capability names.
Note: Moved the couple of cards from OTA-554 to this epic as these cards are relatively less priority for 4.13 release and we could not mark these done.
oc adm release extract --included ... or some such, that only works when no release pullspec is given, where oc connects to the cluster to ask after the current release image (as it does today when you leave off a pullspec) but also collects FeatureGates and cluster profile and all that sort of stuff so it can write only the manifests it expects the CVO to be attempting to reconcile.
This would be narrowly useful for ccoctl (see CCO-178 and CCO-186), because with this extract option, ccoctl wouldn't need to try to reproduce "which of these CredentialsRequests manifests does the cluster actually want filled?" locally.
It also seems like it would be useful for anyone trying to get a better feel for what the CVO is up to in their cluster, for the same reason that it reduces distracting manifests that don't apply.
The downside is that if we screw up the inclusion logic, we could have oc diverging from the CVO, and end up increasing confusion instead of decreasing confusion. If we move the inclusion logic to library-go, that reduces the risk a bit, but there's always the possibility that users are using an oc that is older or newer than the cluster's CVO. Some way to have oc warn when the option is used but the version differs from the current CVO version would be useful, but possibly complicated to implement, unless we take shortcuts like assuming that the currently running CVO has a version matched to the ClusterVersion's status.desired target.
Definition of done (more details in the OTA-692 spike comment):
here is a sketch of code which W. Trevor King suggested
While working on OTA-559, my oc#1237 broke JSON output, and needed a follow-up fix. To avoid destabilizing folks who consume the dev-tip oc, we should grow CI presubmits to exercise critical oc adm release ... pathways, to avoid that kind of accidental breakage.
So it's easier to make adjustments without having to copy/paste code between branches.
It is already possibly to run a cluster with no instantiated image registry, but the image registry operator itself always runs. This is an unnecessary use of resources for clusters that don't need/want a registry. Making it possible to disable this will reduce the resource footprint as well as bug risks for clusters that don't need it, such as SNO and OKE.
To enable the MCO to replace the node-ca, the registry operator needs to provide its own CAs in isolation.
Currently, the registry provides its own CAs via the "image-registry-certificates" configmap. This configmap is a merge of the service ca, storage ca, and additionalTrustedCA (from images.config.openshift.io/cluster).
Because the MCO already has access to additionalTrustedCA, the new secret does not need to contain it.
ACCEPTANCE CRITERIA
TBD
Update ETCD datastore encryption to use AES-GCM instead of AES-CBC
2. What is the nature and description of the request?
The current ETCD datastore encryption solution uses the aes-cbc cipher. This cipher is now considered "weak" and is susceptible to padding oracle attack. Upstream recommends using the AES-GCM cipher. AES-GCM will require automation to rotate secrets for every 200k writes.
The cipher used is hard coded.
3. Why is this needed? (List the business requirements here).
Security conscious customers will not accept the presence and use of weak ciphers in an OpenShift cluster. Continuing to use the AES-CBC cipher will create friction in sales and, for existing customers, may result in OpenShift being blocked from being deployed in production.
4. List any affected packages or components.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
The Kube APIserver is used to set the encryption of data stored in etcd. See https://docs.openshift.com/container-platform/4.11/security/encrypting-etcd.html
Today with OpenShift 4.11 or earlier, only aescbc is allowed as the encryption field type.
RFE-3095 is asking that aesgcm (which is an updated and more recent type) be supported. Furthermore RFE-3338 is asking for more customizability which brings us to how we have implemented cipher customzation with tlsSecurityProfile. See https://docs.openshift.com/container-platform/4.11/security/tls-security-profiles.html
Why is this important? (mandatory)
AES-CBC is considered as a weak cipher
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
The new aesgcm encryption provider was added in 4.13 as techpreview, but as part of https://issues.redhat.com/browse/API-1509, the feature needs to be GA in OCP 4.13.
AES-GCM encryption was enabled in cluster-openshift-apiserver-operator and cluster-openshift-autenthication-operator, but not in the cluster-kube-apiserver-operator. When trying to enable aesgcm encryption in the apiserver config, the kas-operator will produce an error saying that the aesgcm provider is not supported.
The new aesgcm encryption provider was added in 4.13 as techpreview, but as part of https://issues.redhat.com/browse/API-1509, the feature needs to be GA in OCP 4.13.
The new aesgcm encryption provider was added in 4.13 as techpreview, but as part of https://issues.redhat.com/browse/API-1509, the feature needs to be GA in OCP 4.13.
The new aesgcm encryption provider was added in 4.13 as techpreview, but as part of https://issues.redhat.com/browse/API-1509, the feature needs to be GA in OCP 4.13.
Support Platform external to allow installing with agent on OCI, with focus on https://www.oracle.com/cloud/cloud-at-customer/dedicated-region/faq/ for disconnected, on-prem.
OCPSTRAT-510 OpenShift on Oracle Cloud Infrastructure (OCI) with VMs
Support Platform external to allow installing with agent on OCI, with focus on https://www.oracle.com/cloud/cloud-at-customer/dedicated-region/faq/ for disconnected, on-prem
As a user, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of the agent-based installer, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of the agent-based installer, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Support OpenShift installation in AWS Shared VPC [1] scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.
As a user I need to use a Shared VPC [1] when installing OpenShift on AWS into an existing VPC. Which will at least require the use of a preexisting Route53 hosted zone where I am not allowed the user "participant" of the shared VPC to automatically create Route53 private zones.
The Installer is able to successfully deploy OpenShift on AWS with a Shared VPC [1], and the cluster is able to successfully pass osde2e testing. This will include at least the scenario when private hostedZone belongs to different account (Account A) than cluster resources (Account B)
[1] https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
I want
so that I can
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic —
Enhancement PR: https://github.com/openshift/enhancements/pull/1397
API PR: https://github.com/openshift/api/pull/1460
Ingress Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/928
Feature Goal: Support OpenShift installation in AWS Shared VPC scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.
The ingress operator is responsible for creating DNS records in AWS Route53 for cluster ingress. Prior to the implementation of this epic, the ingress operator doesn't have the capability to add DNS records into an existing Route 53 hosted zone in the shared VPC.
As described in the WIP PR https://github.com/openshift/cluster-ingress-operator/pull/928, the ingress operator will consume a new API field that contains the IAM Role ARN for configuring DNS records in the private hosted zone. If this field is present, then the ingress operator will use this account to create all private hosted zone records. The API fields will be described in the Enhancement PR.
The ingress operator code will accomplish this by defining a new provider implementation that wraps two other DNS providers, using one of them to publish records to the public zone and the other to publish records to the private zone.
See NE-1299
See NE-1299
Develop the implementation for supporting AWS Shared VPC pre-existing Route53 as it is described in the enhancement: https://github.com/openshift/enhancements/pull/1397
During oc login with a token, pasting the token on command line with oc login --token command is insecure. The token is logged in bash history, and appears in a "ps" command when ran precisely at the time the oc login command runs. Moreover, the token gets logged and is searchable by any sysadmin.
Customers/Users would like either the "--web" command, or a command that prompt for a token. There should be no way to pass a secret on a command line with --token command.
For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal*
During oc login with a token, pasting the token on command line with oc login --token command is insecure. The token is logged in bash history, and appears in a "ps" command when ran precisely at the time the oc login command runs. Moreover, the token gets logged and is searchable by any sysadmin.
Customers/Users would like either the "--web" command, or a command that prompt for a token. There should be no way to pass a secret on a command line with --token command.
For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.
Why is this important? (mandatory)
Pasting the token on command line with oc login --token command is insecure
Scenarios (mandatory)
Customers/Users would like either the "--web" command. There should be no way to pass a secret on a command line with --token command.
For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
In order to secure token usage during oc login, we need to add the capability to oc to login using the OAuth2 Authorization Code Grant Flow through a browser. This will be possible by providing a command line option to oc:
oc login --web
Add e2e tests in the OSIN library for redirect URI validation without ports on non-loopback links.
In order for the OAuth2 Authorization Code Grant Flow to work in oc browser login, we need a new OAuthClient that can obtain tokens through [PKCE|https://datatracker.ietf.org/doc/html/rfc7636,] as the existing clients do not have this capability. The new client will be called openshift-cli-client and will have the loopback addresses as valid Redirect URIs.
In order for the OAuth2 Authorization Code Grant Flow to work in oc browser login, the OSIN server must ignore any port used in the Redirect URIs of the flow when the URIs are the loopback addresses. This has already been added to OSIN; we need to update the oauth-server to use the latest version of OSIN in order to make use of this capability.
Review the OVN Interconnect proposal, figure out the work that needs to be done in ovn-kubernetes to be able to move to this new OVN architecture.
OVN IC will be the model used in Hypershift.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
For interconnect upgrades - i.e when moving from OCP 4.13 to OCP 4.14 where IC is enabled, we do a 2 phase rollout of ovnkube-master and ovnkube-node pods in the openshift-ovn-kubernetes namespace. This is to ensure we have minimum disruption since major architectural components are being brought from control-plane down to the data-plane.
Since its a two phase roll out with each phase taking taking approximately 10mins, we effectively double the time it takes for OVNK component to upgrade thereby increasing the timeout thresholds on AWS.
See https://redhat-internal.slack.com/archives/C050MC61LVA/p1689768779938889 for some more details.
See sample runs:
I have noticed this happening once on GCP:
This has not happened on Azure which has 95mins allowance. So this card tracks the work to increase the timers on AWS/GCP. This was brought up in the TRT team sync that happened yesterday (July 19th 2023) and Scott Dodson has agreed to approve this under the condition that we bring it down back to the current values in release 4.15.
SDN team is confident the time will drop back to normal for future upgrades going from 4.14 -> 4.15 and so on. This will be tracked via https://issues.redhat.com/browse/OTA-999
Work with https://issues.redhat.com/browse/SDN-3654 card to get data from scale team as needed and continue to improvise the numbers.
In the non-IC world, we have centralised DB, running a trace is easy, in IC world, we'd need all the local DBs from each node to even run a pod2pod trace fully else we can only run half traces with one side DB.
Goal of this card:
Users would desire to create EFA instance MachineSet in the same AWS placement group to get best network performance within that AWS placement group.
The Scope of this Epic is only to support placement groups. Customers will create them.
The customer ask is that placement groups don't need to be created by the OpenShift Container Platform
OpenShift Container Platform only needs to be able to consume them and assign machines out of a machineset to a specific Placement Group.
Users would desire to create EFA instance MachineSet in the same AWS placement group to get best network performance within that AWS placement group.
Note: This Epic was previously connected to https://issues.redhat.com/browse/OCPPLAN-8106 and has been updated to OCPBU-327.
Scope
The Scope of this Epic is only to support placement groups. Customers will create them.
The customer ask is that placement groups don't need to be created by the OpenShift Container Platform
OpenShift Container Platform only needs to be able to consume them and assign machines out of a machineset to a specific Placement Group.
In CAPI, the AWS provider supports the user supplying the name of a pre-existing placement group. Which will then be used to create the instances.
https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/4273
We need to add the same field to our API and then pass the information through in the same way, to allow users to leverage placement groups.
Console operator should be building up a set of cluster nodes OS types, which he should supply to console, so it renders only operators that could be installed on the cluster.
This will be needed when we will support different OS types on the cluster.
We need to scan through the compute nodes and build a set of supported OS from those. Each node on the cluster has a label for its operating system: e.g. kubernetes.io/os=linux,
AC:
1. Proposed title of this feature request
Add a scroll bar for the resource list in the Uninstall Operator pops-up window
2. What is the nature and description of the request?
To make user easy to check the list of all resources
3. Why does the customer need this? (List the business requirements here)
For customers, one operator may have multiple resources, it would be easy for them to check them all in Uninstall Operator pops-up window with the scroll bar
4. List any affected packages or components.
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Console operator should be building up a set of cluster nodes OS types, which he should supply to console, so it renders only operators that could be installed on the cluster.
This will be needed when we will support different OS types on the cluster.
We need to scan through the compute nodes and build a set of supported OS from those. Each node on the cluster has a label for its operating system: e.g. kubernetes.io/os=linux,
AC:
Goal: OperatorHub/OLM users get a more intuitive UX around discovering and selecting Operator versions to install.
Problem statement: Today it's not possible to install an older version of an Operator unless the user exactly nows the CSV semantic version. This is not exposed however through any API. `packageserver` as of today only shows the latest version per channel.
Why is this important: There are many reasons why a user would want to choose not to install the latest version - whether it's lack of testing or known problems. It should be easy for a user to discovers what versions of an Operator OLM has in its catalogs and update graphs and expose this information in a consumable way to the user.
Acceptance Criteria:
Out of scope:
Related info
UX designs: http://openshift.github.io/openshift-origin-design/designs/administrator/olm/select-install-operator-version/
linked OLM jira: https://issues.redhat.com/browse/OPRUN-1399
where you can see the downstream PR: https://github.com/openshift/operator-framework-olm/pull/437/files
specifically: https://github.com/awgreene/operator-framework-olm/blob/f430b2fdea8bedd177550c95ec[…]r/pkg/package-server/apis/operators/v1/packagemanifest_types.go i.e., you can get a list of available versions in PackageChannel stanza from the packagemanifest API
You can reach out to OLM lead Alex Greene for any question regarding this too, thanks
Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster.
Why customers want this?
Why we want this?
Phase 2 Goal: Productization of the united Console
We need a way to show metrics for workloads running on spoke clusters. This depends on ACM-876, which lets the console discover the monitoring endpoints.
Open Issues:
We will depend on ACM to create a route on each spoke cluster for the prometheus tenancy service, which is required for metrics for normal users.
Openshift console backend should proxy managed cluster monitoring requests through the MCE cluster proxy addon to prometheus services on the managed cluster. This depends on https://issues.redhat.com/browse/ACM-1188
Initiative: Improve etcd disaster recovery experience (part1)
The current etcd backup and recovery process is described in our docs https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html
The current process leaves up to the cluster-admin to figure out a way to do consistent backups following the documented procedure.
This feature is part of a progressive delivery to improve the cluster-admin experience for backup and restore of etcd clusters to a healthy state.
Given that we have a controller that processes one time etcd backup requests via the "operator.openshift.io/v1alpha1 EtcdBackup" CR, we need another controller that processes the "config.openshift.io/v1alpha1 Backup" CR so we can have periodic backups according the the schedule in the CR spec.
See https://github.com/openshift/api/pull/1482 for the APIs
The workflow for this controller should roughly be:
Along with this controller we would also need to provide the workload or Go command for the pod that is created periodically by the CronJob. This cmd e.g "create-etcdbackup-cr" effectively creates a new `operator.openshift.io/v1alpha1 EtcdBackup` CR via the following workflow:
Lastly to fulfill the retention policy (None, number of backups saved, or total size of backups), we can employ the following workflow:
Lastly to fulfill the retention policy (None, number of backups saved, or total size of backups), we can employ the following workflow:
See the parent story for more context.
As the first part to this story we need a controller with the following workflow:
Since we also want to preserve a history of successful and failed backup attempts for the periodic config, the CronJob should utilize cronjob history limits to preserve successful and failed jobs.
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#jobs-history-limits
To begin with we can set this to a reasonable default of 5 successful and 10 failed jobs.
For testing the automated backups feature we will require an e2e test that validates the backups by ensuring the restore procedure works for a quorum loss disaster recovery scenario.
See the following doc for more background:
https://docs.google.com/document/d/1NkdOwo53mkNBCktV5tkUnbM4vi7bG4fO5rwMR0wGSw8/edit?usp=sharing
This story targets the milestone 2,3 and 4 of the restore test to ensure that the test has the ability to perform a backup and then restore from that backup in a disaster recovery scenario.
While the automated backups API is still in progress, the test will rely on the existing backup script to trigger a backup. Later on when we have a functional backup API behind a feature gate, the test can switch over to using that API to trigger backups.
We're starting with a basic crash-looping member restore first. The quorum loss scenario will be done in ETCD-423.
We should add some basic backup e2e tests into our operator:
The e2e workflow should be TechPreview enabled already.
For testing the automated backups feature we will require an e2e test that validates the backups by ensuring the restore procedure works for a quorum loss disaster recovery scenario.
See the following doc for more background:
https://docs.google.com/document/d/1NkdOwo53mkNBCktV5tkUnbM4vi7bG4fO5rwMR0wGSw8/edit?usp=sharing
This story targets the first milestone of the restore test to ensure we have a platform agnostic way to be able to ssh access all masters in a test cluster so that we can perform the necessary backup, restore and validation workflows.
The suggested approach is to create a static pod that can do those ssh checks and actions from within the cluster but other alternatives can also be explored as part of this story.
To fulfill one time backup requests there needs to be a new controller that reconciles an EtcdBackup CustomResource (CR) object and executes and saves a one time backup of the etcd cluster.
Similar to the upgradebackupcontroller the controller would be triggered to create a backup pod/job which would save the backup to the PersistentVolume specified by the spec of the EtcdBackup CR object.
The controller would also need to honor the retention policy specified by the EtcdBackup spec and update the status accordingly.
See the following enhancement and API PRs for more details and potential updates to the API and workflow for the one time backup:
https://github.com/openshift/enhancements/pull/1370
https://github.com/openshift/api/pull/1482
< Who benefits from this feature, and how? What is the difference between today's current state and a world with this feature? >
Requirements | Notes | IS MVP |
< What are we making, for who, and why/what problem are we solving?>
<Defines what is not included in this story>
< Link or at least explain any known dependencies. >
Background, and strategic fit
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< If the feature is ordered with other work, state the impact of this feature on the other work>
<links>
Currently pipeline builder in dev console directly queries tekton hub APIs for searching tasks. As upstream community and Red Hat is moving to artifacthub, we need to query artifacthub API for searching tasks.
Hitting the Artifacthub.io search endpoint fails sometimes due to a CORS error and the Version API endpoint always fails due to a CORS error. So, we need a Proxy to hit the Artifacthub. end point to get the data.
Search endpoint: https://artifacthub.io/docs/api/#/Packages/searchPackages
Version endpoint: https://artifacthub.io/docs/api/#/Packages/getTektonTaskVersionDetails
eg: https://artifacthub.io/api/v1/packages/tekton-task/tekton-catalog-tasks/git-clone/0.9.0
Feature Overview (aka. Goal Summary):
This feature will allow an x86 control plane to operate with compute nodes of type Arm in a HyperShift environment.
Goals (aka. expected user outcomes):
Enable an x86 control plane to operate with an Arm data-plane in a HyperShift environment.
Requirements (aka. Acceptance Criteria):
Customer Considerations:
Customers who require a mix of x86 control plane and Arm data-plane for their HyperShift environment will benefit from this feature.
Documentation Considerations:
Interoperability Considerations:
This feature should not impact other OpenShift layered products and versions in the portfolio.
As a user, I would like to deploy a hypershift cluster to an x86 managed cluster with an arm nodepool.
Starting point:
https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/nodepool/nodepool_controller.go
AC:
Goal
Numerous partners are asking for ways to pre-image servers in some central location before shipping them to an edge site where they can be configured as an OpenShift cluster: OpenShift-based Appliance.
A number of these cases are a good fit for a solution based on writing an image equivalent to the agent ISO, but without the cluster configuration, to disk at the central location and then configuring and running the installation when the servers reach their final location. (Notably, some others are not a good fit, and will require OpenShift to be fully installed, using the Agent-based installer or another, at the central location.)
While each partner will require a different image, usually incorporating some of their own software to drive the process as well, some basic building blocks of the image pipeline will be widely shared across partners.
Extended documentation
Building Blocks for Agent-based Installer Partner Solutions
Interactive Workflow work (OCPBU-132)
This work must "avoid conflict with the requirements for any future interactive workflow (see Interactive Agent Installer), and build towards it where the requirements coincide. This includes a graphical user interface (future assisted installer consistency).
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Add a new installer subcommand, openshift-install agent create config-image.
The should create a small ISO (i.e. not a CoreOS boot image) containing just the configuration files from the automation flow:
The contents in the disk could be in any format, but should be optimised to make it simple for the service in AGENT-562 to read.
Implement a systemd service in the unconfigured agent ISO (AGENT-558) that watches for disks to be mounted, then searches them for agent installer configuration. If such configuration is found, then copy it to the relevant places in the running system.
The rendezvousIP must be copied last, as the presence of this is what will trigger the services to start (AGENT-556).
To the extent possible, the service should be agnostic as to the method by which the config disk was mounted (e.g. virtual media, USB stick, floppy disk, &c.). It may be possible to get systemd to trigger on volume mount, avoiding the need to poll anything.
The configuration drive must contain:
it may optionally contain:
The ClusterImageSet manifest must match the one already present in the image for the config to be accepted.
Support pd-balanced disk types for GCP deployments
OpenShift installer and Machine API should support creation and management of computing resources with disk type "pd-balanced"
Why does the customer need this?
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Many enterprises have strict security policies where all the software must be pulled from a trusted or private source. For these scenarios the RHCOS image used to bootstrap the cluster is usually coming from shared public locations that some companies don't accept as a trusted source.
Questions to be addressed:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of problem:
ARO needs to copy RHCOS image blobs to their own Azure Marketplace offering since, as a first party Azure service, they must not request anything from outside of Azure and must consume RHCOS VM images from a trusted source (marketplace). To meet the requirements ARO team currently does the following as part of the release process: 1. Mirror container images from quay.io to Azure Container Registry to avoid leaving Azure boundaries. 2. Copy VM image from the blob in someone else's Azure subscription into the blob on the subscription ARO team manages and then we publish a VM image on Azure Marketplace (publisher: azureopenshift, offer: aro4. See az vm image list --publisher azureopenshift --all). We do not bill for these images. The usage of Marketplace images in the installer was already implemented as part of CORS-1823. This single line [1] needs to be refactored to enable ARO from the installer code perspective: on ARO we don't need to set type to AzureImageTypeMarketplaceWithPlan. However, in OCPPLAN-7556 and related CORS-1823 it was mentioned that using Marketplace images is out of scope for nodes other than compute. For ARO we need to be able to use marketplace images for all nodes. [1] https://github.com/openshift/installer/blob/f912534f12491721e3874e2bf64f7fa8d44aa7f5/pkg/asset/machines/azure/machines.go#L107
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Set RHCOS image from Azure Marketplace in the installconfig 2. Deploy a cluster 3.
Actual results:
Only compute nodes use the Marketplace image.
Expected results:
All nodes created by the Installer use RHCOS image coming from Azure Marketplace.
Additional info:
A user is able to specify a custom location in the Installer manifest for the RHCOS image to be used for bootstrap and cluster Nodes. This is the similar approach we support already for AWS with the compute.platform.aws.amiID option
https://issues.redhat.com/browse/CORS-1103
Some background on the Licenses field:
https://github.com/openshift/installer/pull/3808#issuecomment-663153787
https://github.com/openshift/installer/pull/4696
So we do not want to allow licenses to be specified (it's up to customers to create a custom image with licenses embedded and supply that to the Installer) when pre-built images are specified (current behaviour). Since we don't need to specify licenses for RHCOs images anymore, the Licenses field is useless and should be deprecated.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user, I want to be able to:
so that I can achieve
A user is able to specify a custom location in the Installer manifest for the RHCOS image to be used for bootstrap and cluster Nodes. This is the similar approach we support already for AWS with the compute.platform.aws.amiID option
Epic Goal*
Kubernetes upstream has chosen to allow users to opt-out from CSI volume migration in Kubernetes 1.26 (1.27 PR, 1.26 backport). It is still GA there, but allows opt-out due to non-trivial risk with late CSI driver availability.
We want a similar capability in OCP - a cluster admin should be able to opt-in to CSI migration on vSphere in 4.13. Once they opt-in, they can't opt-out (at least in this epic).
Why is this important? (mandatory)
See an internal OCP doc if / how we should allow a similar opt-in/opt-out in OCP.
Scenarios (mandatory)
Upgrade
New install
EUS to EUS (4.12 -> 4.14)
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
When CSIMigrationvSphere is disabled, cluster-storage-operator must re-create in-tree StorageClass.
vmware-vsphere-csi-driver-operator's StorageClass must not be marked as the default there (IMO we already have code for that).
This also means we need to fix the Disable SC e2e test to ignore StorageClasses for the in-tree driver. Otherwise we will reintroduce OCPBUGS-7623.
More details at ARO managed identity scope and impact.
This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
As a cluster admin, I want the CCM and Node manager to utilize credentials generated by CCO so that the permissions granted to the identity can be scoped with least privilege on clusters utilizing Azure AD Workload Identity.
The Cloud Controller Manager Operator creates a CredentialsRequest as part of CVO manifests which describes credentials that should be created for the CCM and Node manager to utilize. CCM and the Node Manager do not use the credentials created as a product of the CredentialsRequest in existing "passthrough" based Azure clusters or within Azure AD Workload Identity based Azure clusters. CCM and the Node Manager instead use a system-assigned identity which is attached to the Azure cluster VMs.
The system-assigned identity attached to the VMs is granted the "Contributor" role within the cluster's Azure resource group. In order to use the system-assigned identity, a pod must have sufficient privilege to use the host network to contact the Azure instance metadata service (IMDS).
For Azure AD Workload Identity based clusters, administrators must process the CredentialsRequests extracted from the release image which includes the CredentialsRequest from CCCMO manifests. This CredentialsRequest processing results in the creation of a user-assigned managed identity which is not utilized by the cluster. Additionally, the permissions granted to the identity are currently scoped broadly to grant the "Contributor" role within the cluster's Azure resource group. If the CCM and Node Manager were to utilize the identity then we could scope the permissions granted to the identity to be more granular. It may be confusing to administrators to need to create this unused user-assigned managed identity with broad permissions access.
Update Azure Credentials Request manifest of the Cluster Ingress Operator to use new API field for requesting permissions to enable OCPBU-8?
Update Azure Credentials Request manifest of the Cluster Storage Operator to use new API field for requesting permissions
As a [user|developer|<other>] I want [some goal] so that [some reason]
<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it?>
<Describes the context or background related to this story>
As a [user|developer|<other>] I want [some goal] so that [some reason]
<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it?>
<Describes the context or background related to this story>
Add actuator code to satisfy permissions specified in 'Permissions' API field. The implementation should create a new custom role with specified permissions and assign it to the generated user-assigned managed identity along with the predefined roles enumerated in CredReq.RoleBindings. The role we create for the CredentialsRequest should be discoverable so that it can be idempotently updated on re-invocation of ccoctl.
Questions to answer based on lessons learned from custom roles in GCP, assuming that we will create one custom role per identity,
Add a new field (DataPermissions) to the Azure Credentials Request CR, and plumb it into the custom role assigned to the generated user-assigned managed identity's data actions.
Update Azure Credentials Request manifest of the Cluster Image Registry Operator to use new API field for requesting permissions
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
ACCEPTANCE CRITERIA
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
ACCEPTANCE CRITERIA
OPEN QUESTIONS
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
ACCEPTANCE CRITERIA
As a cluster admin I want to be able to:
so that I can
Description of criteria:
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
Create a config secret in the openshift-cloud-credential-operator namespace which contains the AZURE_TENANT_ID to be used for configuring the Azure AD pod identity webhook deployment.
These docs should cover:
See existing documentation for:
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
RHEL CoreOS should be updated to RHEL 9.2 sources to take advantage of newer features, hardware support, and performance improvements.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Questions to be addressed:
Part of setting CPU load balancing on RHEL 9 involves disabling sched_load_balance on cgroups that contain a cpuset that should be exclusive. The PAO may be required to be responsible for this piece
This is the Epic to track the work to add RHCOS 9 in OCP 4.13 and to make OCP use it by default.
CURRENT STATUS: Landed in 4.14 and 4.13
Testing with layering
Another option given an existing e.g. 4.12 cluster is to use layering. First, get a digested pull spec for the current build:
$ skopeo inspect --format "{{.Name}}@{{.Digest}}" -n docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev:4.13-9.2 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4cc3995d5fc11e3b22140d8f2f91f78834e86a210325cbf0525a62725f8e099
Create a MachineConfig that looks like this:
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: worker-override spec: osImageURL: <digested pull spec>
If you want to also override the control plane, create a similar one for the master role.
We don't yet have auto-generated release images. However, if you want one, you can ask cluster bot to e.g. "launch https://github.com/openshift/machine-config-operator/pull/3485" with options you want (e.g. "azure" etc.) or just "build https://github.com/openshift/machine-config-operator/pull/3485" to get a release image.
STATUS: Code is merged for 4.13 and is believed to largely solve the problem.
Description of problem:
Upgrades to from OpenShift 4.12 to 4.13 will also upgrade the underlying RHCOS from 8.6 to 9.2. As part of that the names of the network interfaces may change. For example `eno1` may be renamed to `eno1np0`. If a host is using NetworkManager configuration files that rely on those names then the host will fail to connect to the network when it boots after the upgrade. For example, if the host had static IP addresses assigned it will instead boot using IP addresses assigned via DHCP.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always.
Steps to Reproduce:
1. Select hardware (or VMs) that will have different network interface names in RHCOS 8 and RHCOS 9, for example `eno1` in RHCOS 8 and `eno1np0` in RHCOS 9. 1. Install a 4.12 cluster with static network configuration using the `interface-name` field of NetworkManager interface configuration files to match the configuration to the network interface. 2. Upgrade the cluster to 4.13.
Actual results:
The NetworkManager configuration files are ignored because they don't longer match the NIC names. Instead the NICs get new IP addresses from DHCP.
Expected results:
The NetworkManager configuration files are updated as part of the upgrade to use the new NIC names.
Additional info:
Note this a hypothetical scenario. We have detected this potential problem in a slightly different scenario where we install a 4.13 cluster with the assisted installer. During the discovery phase we use RHCOS 8 and we generate the NetworkManager configuration files. Then we reboot into RHCOS 9, and the configuration files are ignored due to the change in the NICs. See MGMT-13970 for more details.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.
To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).
Phase 1
Phase 2
Phase 3
Phase 1
Phase 2
Phase 3
As described in the external platform enhancement , the cluster-cloud-controller-manager-opeartor should be modified to react to the external platform type in the same manner as platform none.
As described in the external platform enhancement , the machine-api-operator should be modified to react to the external platform type in the same manner as platform none.
Create a Azure cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in Azure) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
Remove code references related to Azure Tags is for TechPreview in below list
Create a severity warning alert to alert to admin that there is packet loss occurring due to failed ovs vswitchd lookups. This may occur if vswitchd is cpu constrained and there are also numerous lookups.
Use metric ovs_vswitchd_netlink_overflow which shows netlink messages dropped by the vswitchd daemon due to buffer overflow in userspace.
For the kernel equivalent, use metric ovs_vswitchd_dp_flows_lookup_lost . Both metrics usually have the same value but may differ if vswitchd may restart.
Both these metrics should be aggregate into a single alert if the value has increased recently.
DoD: QE test case, code merged to CNO, metrics document updated ( https://docs.google.com/document/d/1lItYV0tTt5-ivX77izb1KuzN9S8-7YgO9ndlhATaVUg/edit )
< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >
Requirements | Notes | IS MVP |
< What are we making, for who, and why/what problem are we solving?>
<Defines what is not included in this story>
< Link or at least explain any known dependencies. >
Background, and strategic fit
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< If the feature is ordered with other work, state the impact of this feature on the other work>
<links>
There's no way in the UI for the cluster admin to
Expose the ability for cluster admins to provide customization for all web terminal users through the UI which is available in wtoctl
This is the follow up story for PR - https://github.com/openshift/console/pull/12718. Couple of tests, which are dependent on YAML are added as manual tests. Need to add proper tests for that.
Refer PR - https://github.com/openshift/console/pull/12718 for more details
Update the help texts in initialize Terminal page as below
**
1. "This Project will be used to initialize your command line terminal" to "Project used to initialize your command line terminal"
2. "Set timeout for the terminal." to "Pod timeout for your command line terminal"
3. "Set custom image for the terminal." to "Custom image used for your command line terminal
Update the help texts in initialize Terminal page as below
**
1. "This Project will be used to initialize your command line terminal" to "Project used to initialize your command line terminal"
2. "Set timeout for the terminal." to "Pod timeout for your command line terminal"
3. "Set custom image for the terminal." to "Custom image used for your command line terminal
Allow cluster admin to provide default image and/or timeout period for all cluster users
Default Timeout - WEB_TERMINAL_IDLE_TIMEOUT environment variable's value in the web-terminal-exec DevWorkspaceTemplate Default Image - .spec.components[].container.image field in the web-terminal-tooling DevWorkspaceTemplate
5. Once user change this and save, need to update the same above resources(refer comment in epic https://issues.redhat.com/browse/ODC-7119 for more details)
6. If the user has read access to DevWorkspaceTemplate, then save button should not be enabled and if user don't have read access to DevWorkspaceTemplate then no need to show web terminal tab in configuration page
7. Add e2e tests
Timeout and Image component should be similar to web terminal components (attached in ticket).
refer comment in epic https://issues.redhat.com/browse/ODC-7119 for more details
HyperShift is being consumed by multiple providers. As a result, the need for documentation increases especially around infrastructure/hardware/resource requirements, networking, ..
Before the GA of Hosted Control Planes, we need to know/document:
The above questions are answered for all platforms we support, i.e., we need to answer for
Add support of NAT Gateways in Azure while deploying OpenShift on this cloud to manage the outbound network traffic and make this the default option for new deployments
While deploying OpenShift on Azure the Installer will configure NAT Gateways as the default method to handle the outbound network traffic so we can prevent existing issues on SNAT Port Exhaustion issue related to the configured outboundType by default.
The installer will use the NAT Gateway object from Azure to manage the outbound traffic from OpenShift.
The installer will create a NAT Gateway object per AZ in Azure so the solution is HA.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Using NAT Gateway for egress traffic is the recommended approach from Microsoft
This is also a common ask from different enterprise customers as with the actual solution used by OpenShift for outbound traffic management in Azure they are hitting SNAT Port Exhaustion issues.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a administrator, I want to be able to:
so that I can achieve
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
You can use the oc-mirror OpenShift CLI (oc) plugin to mirror all required OpenShift Container Platform content and other images to your mirror registry by using a single tool. It provides the following features:
This feature is track bring the oc mirror plugin to IBM Power and IBM zSystem architectures
Bring the oc mirror plugin to IBM Power and IBM zSystem architectures
oc mirror plugin on IBM Power and IBM zSystems should behave exactly like it does on x86 platforms.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
If this Epic is an RFE, please complete the following questions to the best of your ability:
Q1: Proposed title of this RFE
Q2: What is the nature and description of the RFE?
oc mirror plugin will be the tool for mirror plugin
Q3: Why does the customer need this? (List the business requirements here)
install disconnected cluster without having x86 nodes available to manage the disconnected installation
Q4: List any affected packages or components
Quay on the platform needs be available for saving the images.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
This includes ibm-vpc-node-label-updater!
(Using separate cards for each driver because these updates can be more complicated)
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
EOL, do not upgrade:
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
Please review the following PR: https://github.com/openshift/machine-config-operator/pull/3598
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository
Agent-based installer requires to boot the generated ISO on the target nodes manually. Support for PXE booting will allow customers to automate their installations via their DHCP/PXE infrastructure.
This feature allows generating installation ISOs ready to add to a customer-provided DHCP/PXE infrastructure.
As an OpenShift installation admin I want to PXE-boot the image generated by the openshift-install agent subcommand
We have customers requesting this booting mechanism to make it easier to automate the booting of the nodes without having to actively place the generated image in a bootable device for each host.
As an OpenShift installation admin I want to PXE-boot the image generated by the openshift-install agent subcommand
We have customers requesting this booting mechanism to make it easier to automate the booting of the nodes without having to actively place the generated image in a bootable device for each host.
As a user of the Agent-based Installer(ABI), I want to be able to perform the customizations via agent-tui in case of PXE booting so that I can modify network settings.
Implementation details:
Create a new baseImage asset that gets inherited by agentImage and agentpxefiles. The baseImage prepares the initrd along with the necessary ignition and the network tui which is now read by agentImage and agentpxefiles.
ARM kernels are compressed with gzip, but most versions of ipxe cannot handle this (it's not clear what happens with raw pxe). See https://github.com/coreos/fedora-coreos-tracker/issues/1019 for more info.
If the platform is aarch64 we'll need to decompress the kernel like we do in https://github.com/openshift/machine-os-images/commit/1ed36d657fa3db55fc649761275c1f89cd7e8abe
The new command {{agent create pxe-files }} reads - pxe-base-url from the agent-config.yaml. The field will be optional in the yaml file. If the URL is provided, then the command will generate an ipxe script specific to the given URL.
Currently, we have the kernel parameters in the iPXE script statically defined from what Assisted Service generates. If the default parameters were to change in RHCOS that would be problematic. Thus, it would be much better if we were to extract them from the ISO.
The kernel parameters in the ISO are defined in EFI/redhat/grub.cfg (UEFI) and /isolinux/isolinux.cfg (legacy boot)
Support deploying multi-node clusters using platform none.
As of Jan 2023 we have almost 5,000 clusters reported using platform none installed on-prem (metal, vmware or other hypervisors with no platform integration) out of a total of about 12,000 reported clusters installed on-prem.
Platform none is desired by users to be able to install clusters across different host platforms (e.g. mixing virtual and physical) where Kubernetes platform integration isn't a requirement.
A goal of the Agent-Based Installer is to help users who currently can only deploy their topologies with UPI to be able to use the agent-based installer and get a simpler user experience while keeping all their flexibility.
Currently there are validation checks for platform None in OptionalInstallConfig that limits the None platform to 1 control plane replica, 0 compute replicas, and the NetworkType to OVNKubernetes.
These validation should be removed so that the None platform can be installed on clusters of any configuration.
Acceptance Criteria:
Add support to the Installer to make the S3 bucket deletion process during cluster bootstrap on AWS optional.
Allow the user to opt-out for deleting the S3 bucket created during the cluster bootstrap on AWS.
The user will be able to opt-out from deleting the S3 bucket created during the cluster bootstrap on AWS via the install-config manifest so the Installer will not try to delete this resource when destroying the bootstrap instance and the S3 bucket.
The actual behavior will remain the default behavior while deploying OpenShift on AWS and both the bootstrap instance and the S3 bucket will be removed unless the user has opted-out for this via the install-config manifest.
Some ROSA customers have SCP policies that prevent the deletion of any S3 bucket preventing ROSA adoption for these customers.
There will be documentation required for this feature to explain how to prevent the Installer to remove the S3 bucket as well as an explanation on the security concerns while doing this since sensible the Installer will leave sensible data used to bootstrap the cluster in the S3 bucket.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a developer, I want to:
so that I can
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Console Support of OpenShift Pipelines Migration to Tekton v1 API
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Pipeline API version is upgrading to v1 with Red Hat Pipeline operator 1.11.0 release.
https://tekton.dev/vault/pipelines-main/migrating-v1beta1-to-v1/
Does this have to be backward compatible?
Will the features be equivalent? Will the UX / tests / documentation have to be updated?
As a user,
Description of problem:
When trying the old pipelines operator with the latest 4.14 build I couldn't see the Pipelines navigation items. The operator provides the Pipeline v1beta1, not v1.
Version-Release number of selected component (if applicable):
4.14 master only after https://github.com/openshift/console/pull/12729 was merged
How reproducible:
Always?
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
The observable functionality that the user now has as a result of receiving this feature. Complete during New status.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a developer of serverless functions, we don't provide any samples.
Provide Serverless Function samples in the sample catalog. These would be utilizing the Builder Image capabilities.
As an operator author, I want to provide additional samples that are tied to an operator version, not an OpenShift release. For that, I want to create a resource to add new samples to the web console.
As an operator author, I want to provide additional samples that are tied to an operator version, not an OpenShift release. For that, I want to create a resource to add new samples to the web console.
As Arm adoption grows OpenShift on Arm is a key strategic initiative for Red Hat. Key to success is the support of all key cloud providers adopting this technology. Google have announced support for Arm in their GCP offering and we need to support OpenShift in this configuration.
The ability to have OCP on Arm running in a GCP instance
OCP on Arm running in a GCP instance
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Description:
Update 4.14 documentation to reflect new GCP support on ARM machines.
Updates:
Acceptance criteria:
Description:
In order to add instance types to the OCP documentation, there needs to be a .md file in the OpenShift installer repo that contains the 64-bit ARM machine types that have been texted and are supported on GCP.
Create a PR in the OpenShift installer repo that creates a new .md file that shows the supported instance types
Acceptance criteria:
Azure File CSI supports both SMB and NFS protocol. Currently we only support SMB and there is a strong demand from IBM and individual customers to support NFS for posix compliance reasons.
Support Azure File CSI with NFS.
The Azure File operator will not automatically create a NFS storage class, we will document how to create one.
There are some concerns on the way Azure File CSI deals with NFS. They don't respect the FSGroup policy supplied in the pod definition. This breaks kubernetes convention where a pod should be able to define its own FSGroup policy, instead Azure File CSI set a per driver policy that pods can't override.
We brought up this problem to MSFT but there is no fix planned on the driver, given the pressure from the field we are going to support NFS with a on root mismatch default and document this specific behavior in our documentation.
As an OCP on Azure admin i want my user to be able to consume NFS based PVs through Azure File CSI.
As an OCP on Azure user i want to attach NFS based PVs to my pods.
As an ARO customer I want to consume NFS based PVs.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Running two drivers, one for NFS and one for SMB to solve the FSGroup issue.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
This feature is candidate to be backported up to 4.12 if possible.
Document that Azure File CSI NFS is supported, how to create a storage class as well as the FSGroup issue.
It's been decided to support the driver as it is today (see spike STOR-992) knowing it violates fsGroupChangePolicy kubernetes standard where a pod is able to decide what FS group policy should be applied. Azure File with NFS applies a FS group policy at the driver level and pods cannot override it. We will keep the driver's default (on root mismatch) and document this non conventional behavior. Also, the Azure File CSI operator will not create a storage class for NFS, admins will need to create it manually this will be documented.
There is no need to specific development in the driver nor the operator, engineering will make sure we have a working CI.
1. Proposed title of this feature request
Enable privileged containers to view rootfs of other containers
2. What is the nature and description of the request?
The skip_mount_home=true field in the /etc/containers/storage.conf causes the mount propegation of container mounts to not be private, which allows privileged containers to access the rootfs of other containers. This is a fix for bug 2065283 (see comment #32 [2]).
This RFE is to enable that field by default in Openshift, as well as verify there are no performance regressions when applying it.
3. Why does the customer need this? (List the business requirements here)
Customer's use case:
Our agent runs as a daemonset in k8s clusters and monitors the node.
Running with mount propagation set to HostToContainer allows the agent to access any container file, also containers which start running after agent startup. With this settings, when a new container starts, a new mount is created and added to the host mount namespace and also to the agent container and by that the agent can access the container files
e.g. the agent is mounted to /host and can access to the filesystem of other container by path
/host/var/lib/containers/storage/overlay/xxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged/test_fileThis approach works in k8s clusters and OpenShift 3, but not in OpenShift 4. How can I make the agent pod to get noticed about any new mount which was created on the node and get access to it as well?
The workaround for that was provided in bug 2065283 (see comment #32 [2]).
4. List any affected packages or components.
CRI-O, Node, MCO.
Additional information in this Slack discussion [3].
[1] https://docs.openshift.com/container-platform/4.11/post_installation_configuration/machine-configuration-tasks.html#create-a-containerruntimeconfig_post-install-machine-configuration-tasks
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2065283#c32
[3] https://coreos.slack.com/archives/CK1AE4ZCK/p1670491480185299
Currently, SCCs are part of the OpenShift API and are subject to modifications by customers. This leads to a constant stream of issues:
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Currently, SCCs are part of the OpenShift API and are subject to modifications by customers. This leads to a constant stream of issues:
We need to find and implement schemes to protect core workloads while retaining the API guarantee for modifications of SCCs (unfortunately).
Users of the OpenShift Console leverage a streamlined, visual experience when discovering and installing OLM-managed operators in clusters that run on cloud providers with support for short-lived token authentication enabled. Users are intuitively becoming aware when this is the case and are put on the happy path to configure OLM-managed operators with the necessary information to support AWS STS.
Customers do not need to re-learn how to enable AWS STS authentication support for each and every OLM-managed operator that supports it. The experience is standardized and repeatable so customers spend less time with initial configuration and more team implementing business value. The process is so easy that OpenShift is perceived as enabler for an increased security posture.
The OpenShift Console today provides little to no support for configuring OLM-managed operators for short-lived token authentication. Users are generally unaware if their cluster runs on a cloud provider and is set up to use short-lived tokens for its core functionality and users are not aware which operators have support for that by implementing the respective flows defined in OCPBU-559 and OCPBU-560.
Customers may or may not be aware about short-lived token authentication support. They need to proper context and pointers to follow-up documentation to explain the general concept and the specific configuration flow the Console supports. It needs to become clear that the Console cannot 100% automate the overall process and some steps need to be run outside of the cluster/Console using Cloud-provider specific tooling.
This epic is tracking the console work needed for STS enablement. As well as documentation needed for enabling operator teams to use this new flow. This does not track Hypershift inclusion of CCO.
Plan is to backport to 4.12
install flow:
As a user of the console, I would like to provide the required fields for tokenized auth at install time (wrapping and providing sane defaults for what I can do manually in the CLI).
The role ARN provided by the user should be added to the service account of the installed operator as an annotation.
Only manual subscription is supported in STS mode - the automatic option should be not be the default or should be grey'd out entirely
AC: Add input field to the operator install page, where user can provide the `roleARN` value. This value will be set on the operator's Subscription resource, when installing operator.
STS - Security Token Service
Cluster is in STS mode when:
AC: Inform user on the Operator Hub item details that the cluster is in the STS mode
As a user of the console I would like to know which operators are safe to install (i.e. support tokenized auth or don't talk to the cloud provider).
AC: Add filter to the Operator Hub for filtering operators which have Short Lived Token Enabled
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
In OpenShift 4.14, we intend to deliver functionality in code that will help accelerate moving to PSA enforcement. This feature tracks those deliverables.
The observable functionality that the user now has as a result of receiving this feature. Complete during New status.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal*
Deliver tools and code that helps toward PSa enforcement
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Don't enforce system defaults on a namespace's pod security labels, if it is managed by a user.
If the managedFields (https://kubernetes.io/docs/reference/using-api/server-side-apply/#field-management) indicate that a user changed the pod security labels, we should not enforce system defaults.
A user might not be aware that the label syncer can be turned off and tries to manually change the state of the pod security profiles.
This fight between a user and the label syncer can cause violations.
< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >
Requirements | Notes | IS MVP |
< What are we making, for who, and why/what problem are we solving?>
<Defines what is not included in this story>
< Link or at least explain any known dependencies. >
Background, and strategic fit
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< If the feature is ordered with other work, state the impact of this feature on the other work>
<links>
Additional improvements to segment, to enable the proper gathering of user telemetry and analysis
Currently, we have no accurate telemetry of the OpenShift Console usage across all fleet clusters. We should be able to utilize the auth and console telemetry to glean details which will allow us to get a picture of console usage by our customers.
There is no way to properly track specific pages
Change page title for all resource details pages to {resource-name} · {resource} · {tab-name} · OKD
Need to check all the resource pages which have details page and change the title.
Update page title to have non-translated title in {resource-name} · {resource} · {tab-name} · OKD format
All page titles of resource details page to be added as a non-translated value in {resource-name} · {resource} · {tab-name} · OKD format inside <title> component as attribute with name for ex, data-title-id and use this value in fireUrlChangeEvent to send it as title for telemetry page event. Refer spike https://issues.redhat.com/browse/ODC-7269 for more details
Refer spike https://issues.redhat.com/browse/ODC-7269 for more details
labelKeyForNodeKind now returns translated value, before it used to return label key. So Change method name for labelKeyForNodeKind to getTitleForNodeKind
One of the steps in doing a disconnected environment install is to mirror the images to a designated system. This feature enhances oc-mirror to not handle the multi release payload, that is the payload that contains all the platform images (x86, Arm, IBM Power, IBM Z). This is a key feature towards supporting disconnected installs in a multi-architecture compute i.e. mixed architecture cluster environment.
Customers will be able to use oc-mirror to enable the multi payload in a disconnected environment.
Allow oc-mirror to mirror the multi release payload
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
ACCEPTANCE CRITERIA
ImportMode api reference: https://github.com/openshift/api/blob/master/image/v1/types.go#L294
Original issue and discussion: https://coreos.slack.com/archives/CFFJUNP6C/p1664890804998069
ACCEPTANCE CRITERIA
ImportMode api reference: https://github.com/openshift/api/blob/master/image/v1/types.go#L294
With this feature it will be possible to autoscale from zero, that is have machinesets that create new nodes without any existing current nodes, for use in a mixed architecture cluster configured with multi-architecture compute
To be able to create a machineset and scale from zero in a mixed architecture cluster environment
Create a machineset and scale from zero in a mixed architecture cluster environment
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Filing a ticket based on this conversation here: https://github.com/openshift/enhancements/pull/1014#discussion_r798674314
Basically the tl;dr here is that we need a way to ensure that machinesets are properly advertising the architecture that the nodes will eventually have. This is needed so the autoscaler can predict the correct pool to scale up/down. This could be accomplished through user driven means like adding node arch labels to machinesets and if we have to do this automatically, we need to do some more research and figure out a way.
For autoscaling nodes in a multi-arch compute cluster, node architecture needs to be taken into account because such a cluster could potentially have nodes of upto 4 different architectures. Labels can be propagated today from the machineset to the node group, but they have to be injected manually.
This story explores whether the autoscaler can use cloud provider APIs to derive the architecture of an instance type and set the label accordingly rather than it needing to be a manual step.
For autoscaling nodes in a multi-arch compute cluster, node architecture needs to be taken into account because such a cluster could potentially have nodes of upto 4 different architectures. Labels can be propagated today from the machineset to the node group, but they have to be injected manually.
This story explores whether the autoscaler can use cloud provider APIs to derive the architecture of an instance type and set the label accordingly rather than it needing to be a manual step.
In 4.13 the vSphere CSI migration is in hybrid state. Greenfield 4.13 clusters have migration enabled by default while upgraded clusters have it turned off unless explicitely enabled by an administrator (referred as "opt-in").
This feature tracks the final work items required to enable vSphere CSI migration for all OCP clusters.
More information on the 4.13 vSphere CSI migration is available in the internal FAQ
Finalise vSphere CSI migration for all clusters ensuring that
Regardless of the clusters state (new or upgraded), which version it is upgrading from or status of CSI migration (enabled/disabled), they should all have CSI migration enabled.
This feature also includes upgrades checks in 4.12 & 4.13 to ensure that OCP is running on a recommended vSphere version (vSphere 7.0u3L+ or 8.0u2+)
We should make sure that all issues that prevented us to enabled CSI migration by default in 4.13 are resolved. If some of these issues are fix in vSphere itself we might need to check for a certain vSphere build version before proceeding with the upgrade (from 4.12 or 4.13).
More information on the 4.13 vSphere CSI migration is available in the internal FAQ
Customers who upgraded from 4.12 will unlikely opt in migration so we will have quite a few clusters with migration enabled. Given we will enabled it in 4.14 for every clusters we need to be extra careful that all issues raised are fixed and set upgrade blockers if needed.
Remove all migration opt-in occurences in the documentation.
We need to make sure that upgraded clusters are running on top of a vsphere version that contains all the required fixes.
Epic Goal*
Remove FeatureSet InTreeVSphereVolumes that we added in 4.13.
Why is this important? (mandatory)
We assume that the CSI Migration will be GA and locked to default in Kubernetes 1.27 / OCP 4.14. Therefore the FeatureSet must be removed.
Scenarios (mandatory)
See https://issues.redhat.com/browse/STOR-1265 for upgrade from 4.13 to 4.14
Dependencies (internal and external) (mandatory)
Same as STOR-1265, just the other way around ("a big revert")
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Description of problem:
The vsphereStorageDriver is deprecated and we should allow cluster admins to remove that field from the Storage object in 4.14. This is the validation rule that prevents removing vsphereStorageDriver: https://github.com/openshift/api/blob/0eef84f63102e9d2dfdb489b18fa22676f2bd0c4/operator/v1/types_storage.go#L42 This was originally put in place to ensure that CSI Migration is not disabled again once it has been enabled. However, in 4.14 there is no way to disable migration, and there is an explicit rule to prevent setting LegacyDeprecatedInTreeDriver. So it should be safe to allow removing the vsphereStorageDriver field in 4.14, as this will not disable migration, and the field will eventually be removed from the API in a future release.
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
Steps to Reproduce:
1. Set vsphereStorageDriver in the Storage object 2. Try to remove vsphereStorageDriver
Actual results:
* spec: Invalid value: "object": VSphereStorageDriver is required once set
Expected results:
should be allowed
Additional info:
By moving MCO certificate management out of MachineConfigs, certificate rotation can happen any time, even when pools are paused and would generate no drain or reboot.
Eliminate problems causes by certificate rotations being blocked by paused pools. Keep certificates up-to-date without disruption to workloads.
Windows MCO has been updated to work with this path.
Having additional MCO metrics is helpful to customers who want to closely monitor the state of their Machines and MachineConfigPools.
Add for each MCP:
- Paused
- Updated
- Updating
- Degraded
- Machinecount
- ReadyMachineCount
- UpdatedMachineCount
- DegradedMachineCount
Creating this to version scope the improvements merged into 4.14. Since those changes were in a story, they need an epic.
Customer like to have in Prometheus some metrics of MachineConfigOperator. For each MCP:
- Paused
- Updated
- Updating
- Degraded
- Machinecount
- ReadyMachineCount
- UpdatedMachineCount
- DegradedMachineCount
These metrics would be really important, as it could show any MachineConfig action (updating, degraded, ...), which could also even trigger an alarm with a PrometheusRule. Having a dashboard of MachineConfig would be also really useful.
Extend the Workload Partitioning feature to support multi-node clusters.
Customers running RAN workloads on C-RAN Hubs (i.e. multi-node clusters) that want to maximize the cores available to the workloads (DU) should be able to utilize WP to isolate CP processes to reserved cores.
Requirements
A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
requirement | Notes | isMvp? |
< How will the user interact with this feature? >
< Which users will use this and when will they use it? >
< Is this feature used as part of current user interface? >
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>
<Does the Feature introduce data that could be gathered and used for Insights purposes?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< What does success look like?>
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact>
< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>
< Which other products and versions in our portfolio does this feature impact?>
< What interoperability test scenarios should be factored by the layered product(s)?>
Question | Outcome |
Write a test to execute a known management pods and create a management pod to verify that it adheres to the CPU Affinity and CPU Shares
Ex:
pgrep kube-apiserver | while read i; do taskset -cp $i; done
DaemonSet and Deployment resource checks seem to flake, need to be resolved.
Check on FeatureSet Techpreview no longer needed on the installer, removing the check from the code.
Make validation tests run on all platforms by removing skips.
The original implementation of workload partitioning tried to leverage default behavior for CRIO to allow full use of CPU Sets when no Performance Profile is supplied by the user while still being a CPU partitioned cluster. This works fine for CPU affinity however because we don't supply a config and allow the default behavior to kick in, CRIO does not alter the CPU share and gives all pods 2 CPU Share value.
We need to supply a config for CRIO with an empty string for CPU Set to support both CPU share and CPU affinity behavior when NO performance profile is supplied, so that the `resource.requests` which get altered to CPU Share, are correctly being applied in a default state.
Note, this is not an issue with CPU affinity, that still behaves as expected and when a performance profile is supplied things work as intended as well. The CPU share mismatch is the only issue being identified here.
Create generic validation tests in Origin and Release repo to check that a cluster is correctly configured. E2E tests running in a cpu partitioned cluster should run successfully.
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
The HostedCluster and NodePool specs already has "pausedUntil" field.
pausedUntil: description: 'PausedUntil is a field that can be used to pause reconciliation on a resource. Either a date can be provided in RFC3339 format or a boolean. If a date is provided: reconciliation is paused on the resource until that date. If the boolean true is provided: reconciliation is paused on the resource until the field is removed.' type: string
This option is currently not exposed in "hypershift create cluster" command.
In order to support HCP create/update automation template with ClusterCurator, users should be able to "hypershift create cluster" with the PausedUntil flag.
Improve the kubevirt-csi storage plugin features and integration as we make progress towards the GA of a KubeVirt provider for HyperShift.
Infra storage classes made available to guest clusters must support:
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue (or individual upstream PRs) | <link to GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | <link or reference to Polarion> |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |
kubevirt-csi should support/apply fsGroup settings in filesystems created by kubevirt-csi
The HyperShift KubeVirt platform only supports guest clusters running 4.14 or greater (due to the kubevirt rhcos image only being delivered in 4.14)
and it also only supports OCP 4.14 and CNV 4.14 for the infra cluster.
Add backend validation on the HostedCluster that validates the parameters are correct before processing the hosted cluster. If these conditions are not met, then report back the error as a condition on the hosted cluster CR
Based on the perf scale team's results, enabling multiqueue when jumbo frames (MTU >=9000) can greatly improve throughput. as see by comparing slides 8 and 10 in this slide deck, https://docs.google.com/presentation/d/1cIm4EcAswVDpuDp-eHVmbB7VodZqQzTYCnx4HCfI9n4/edit#slide=id.g2563dda6aa5_1_68
However enabling multiqueue with small MTU causes "throughput to crater".
This task involves adding an API option to the kubevirt platform within the nodepool api, as well as adding a cli option for enabling multiqueue in the hcp cli (new productized cli)
Many customers still predominantly use logs as a main source to capture data that's important to quickly identify problems. Many issues can also be identified by metrics but there are some events in security, such as suspicious IP address activity, or runtime system issues such as host errors, where logs are your friend. OpenShift currently only support defining alerting rules and get notification based on metrics. That leaves a big gap to help identifying and being notified for the previous mentioned events immediately.
As we move the Logging stack towards using Loki (see OBSDA-7), we will be able to use it's out-of-the-box capabilities to define alerting rules on logs using LogQL. That approach is very similar to Prometheus' alerting ecosystem and actually gives us the opportunity to reuse Prometheus' Alertmanager to distribute alerts/notifications. For customers, this means they do not need to configure different channels twice, for metrics and logs, but reuse the same configuration.
For the configuration itself, we need to look into introducing a CRD (similar to the PrometheusRule CRD inside the Prometheus Operator) to allow users with non-admin permissions to configure the rules without changing the central Loki configuration.
Since OpenShift 4.6, application owners can configure alerting rules based on metrics themselves as described in User Workload Monitoring (UWM) enhancement. The rules are defined as PrometheusRule resources and can be based on platform and/or application metrics.
To expand the alerting capabilities on logs as an observability signal, cluster admins and application owners should be able to configure alerting rules as described in the Loki Rules docs and in the Loki Operator Ruler upstream enhancement.
AlertingRule CRD fullfills the requirement to define alerting rules for Loki similar to PrometheusRule.
RulerConfig CRD fullfills the requirement to connect the Loki Ruler component to notify a list of Prometheus AlertManager hosts on firing alerts.
“As a dev user, I want to use the silences as admins do, so I can get the same features”
Given a dev user logged in to the console and using a developer perspective
When the user navigates to the observe section
Then the user can see a silences tab that has the same features as the admin but restricted only to the current selected namespace
This feature aims to enhance observability and user experience for customers of self-managed Hosted Control Planes (HCP) using ACM/MCE by leveraging the existing observability feature stack (e.g., the pluggable dashboard console feature in the OCP console as the MVP in case ACM is not in use). This approach ensures improved monitoring capabilities and aligns with the tenancy model of User Workload Monitoring (UWM), also strongly encourages an upsell from MCE to ACM to access those features and provide a best/practice and validated pattern for customers willing to build it on their own (with a lot of effort vs. ACM).
Users, particularly SRE teams (the cluster service provider persona), will gain enhanced visibility into the health and performance of their HCPs through a customizable monitoring dashboard. This dashboard will provide critical metrics and alerts, aiding in proactive management and troubleshooting. Existing observability features in ACM will be expanded to include these capabilities.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
---|---|
Self-managed, managed, or both | Self-managed (but reusable in managed with xCM) |
Classic (standalone cluster) | N/A |
Hosted control planes | Applicable |
Multi node, Compact (three node), or Single node (SNO), or all | N/A |
Connected / Restricted Network | Applicable |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | Applicable |
Operator compatibility | Observability Operator (ObO) |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | OpenShift Console, dynamic plugin |
Other (please specify) | N/A |
The usage of UWM for HCP metrics on the management cluster has a few drawbacks:
These issues would be resolved with using ObO, which is currently being productized.
Other questions to answer:
This feature should leverage existing functionality when possible to align with other OCP observability efforts (e.g., pluggable dashboard console feature in the OCP console) to provide enhanced observability for HCP users. It should align with the existing UWM tenancy model and address immediate monitoring needs while considering future improvements via the Observability Operator.
Customers opting for full metrics export must be aware of the potential impact on the monitoring stack. Clear documentation and guidelines will be provided to manage configuration and alerts effectively.
Documentation will include setup guides, configuration examples, and troubleshooting tips. It will also link to existing ACM observability documentation for comprehensive coverage.
As a hosted cluster deployer I want to have the HyperShift Operator:
so that:
https://docs.google.com/document/d/1UwHwkL-YtrRJYm-A922IeW3wvKEgCR-epeeeh3CBOGs/edit
configMap example: https://github.com/openshift/console-dashboards-plugin/blob/main/docs/add-datasource.md
tldr: three basic claims, the rest is explanation and one example
While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.
One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.
I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.
We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.
Relevant links:
Epic Template descriptions and documentation.
Write DNS local endpoint test that uses TCP in the origin repo just like previous on done in NE-1068.
Go 1.16 added the new embed directive to go. This embed directive lets you natively (and trivially) compile your binary with static asset files.
The current go-bindata dependency that's used in both the Ingress and DNS operator's for yaml asset compilation could be dropped in exchange for the new go embed functionality. This would reduce our dependency count, remove the need for `bindata.go` (which is version controlled and constantly updated), and make our code easier to read. This switch would also reduce the overall lines of code in our repos.
Note that this may be applicable to OCP 4.8 if and when images are built with go 1.16.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Use t.Run to run test cases for table-driven unit tests, for consistency.
Refactor Test_desiredLoadBalancerService to match our unit test standards, remove extraneous test cases, and make it more readable/maintainable.
Unit tests names should be formatted with Test_Function name, so that the scope of the function (private or Public) can be preserved.
Test_desiredHttpErrorCodeConfigMap contains a section that has dead code when checking for expect == nil || actual == ||. Clean this up.
Also replace Ruby-style #{} syntax for string interpolation with Go string formats.
Hypershift-provisioned clusters, regardless of the cloud provider support the proposed integration for OLM-managed integration outlined in OCPBU-559 and OCPBU-560.
There is no degradation in capability or coverage of OLM-managed operators support short-lived token authentication on cluster, that are lifecycled via Hypershift.
Currently, Hypershift lacks support for CCO.
Currently, Hypershift will be limited to deploying clusters in which the cluster core operators are leveraging short-lived token authentication exclusively.
If we are successful, no special documentation should be needed for this.
Outcome Overview
Operators on guest clusters can take advantage of the new tokenized authentication workflow that depends on CCO.
Success Criteria
CCO is included in HyperShift and its footprint is minimal while meeting the above outcome.
Expected Results (what, how, when)
Post Completion Review – Actual Results
After completing the work (as determined by the "when" in Expected Results above), list the actual results observed / measured during Post Completion review(s).
Every guest cluster should have a running CCO pod with its kubeconfig attached to it.
Enchancement doc: https://github.com/openshift/enhancements/blob/master/enhancements/cloud-integration/tokenized-auth-enablement-operators-on-cloud.md
CCO currently deploys the pod identity webhook as part of its deployment. As part of the effort to reduce the footprint of CCO, the deployment of this pod should be conditional on the infrastructure.
This epic tracks work related to designing how to include CCO into HyperShift in order for operators on guest clusters to leverage the STS UX defined by this project.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self-managed (though could be managed by partner) |
Classic (standalone cluster) | Classic |
Hosted control planes | Future |
Multi node, Compact (three node), or Single node (SNO), or all | SNO |
Connected / Restricted Network | All – connected and disconnected, air-gapped |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_x64 |
Operator compatibility | TBD |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | N/A |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
In OCP 4.14, we provided the ability to pass cluster configs to the agent-based installer (AGI) after booting image (AGENT-559).
In OCP 4.15, we published in upstream how you can use the Appliance Image builder utility to build disk images using Agent-based Installer to enable appliance installations — see https://github.com/openshift/appliance/blob/master/docs/user-guide.md. This is “Dev Preview”. The appliance tooling is currently supported and maintained by ecosystem engineering.
In OCP 4.16, this Appliance image builder utility will be bundled and shipped and will be available at registry.redhat.io (we are “productizing” this part). In the near term, we’ll document this via KCS and not official docs (to minimize confusion about documenting a feature that only impacts a small subset of appliance partners).
This appliance tool combines 2 features:
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) Bare metal
Use scenarios
Why is this important
Requirement | Notes |
---|---|
OCI Bare Metal Shapes must be certified with RHEL | It must also work with RHCOS (see iSCSI boot notes) as OCI BM standard shapes require RHCOS iSCSI to boot ( Certified shapes: https://catalog.redhat.com/cloud/detail/249287 |
Successfully passing the OpenShift Provider conformance testing – this should be fairly similar to the results from the OCI VM test results. | Oracle will do these tests. |
Updating Oracle Terraform files | |
Making the Assisted Installer modifications needed to address the CCM changes and surface the necessary configurations. | Support Oracle Cloud in Assisted-Installer CI: |
RFEs:
Any bare metal Shape to be supported with OCP has to be certified with RHEL.
From the certified Shapes, those that have local disks will be supported. This is due to the current lack of support in RHCOS for the iSCSI boot feature. OCPSTRAT-749 is tracking adding this support and remove this restriction in the future.
As of Aug 2023 this excludes at least all the Standard shapes, BM.GPU2.2 and BM.GPU3.8, from the published list at: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#baremetalshapes
Please describe what this feature is going to do.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
Configure a periodic job (twice a week) to run tests on OCI
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:
Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators
Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators
Phase 3 (OpenShift 4.13): OCPBU-117
Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)
Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly OCPBU-519)
Phase 6 (OpenShift 4.16): OCPSTRAT-731
Phase 7 (OpenShift 4.17): OCPSTRAT-1308
Questions to be addressed:
Once the MCO team is done moving the node-ca functionality to the MCO (MCO-499), we need to remove the node-ca from CIRO.
ACCEPTANCE CRITERIA
With this feature MCE will be an additional operator ready to be enabled with the creation of clusters for both the AI SaaS and disconnected installations with Agent.
Currently 4 operators have been enabled for the Assisted Service SaaS create cluster flow: Local Storage Operator (LSO), OpenShift Virtualization (CNV), OpenShift Data Foundation (ODF), Logical Volume Manager (LVM)
The Agent-based installer doesn't leverage this framework yet.
When a user performs the creation of a new OpenShift cluster with the Assisted Installer (SaaS) or with the Agent-based installer (disconnected), provide the option to enable the multicluster engine (MCE) operator.
The cluster deployed can add itself to be managed by MCE.
Deploying an on-prem cluster 0 easily is a key operation for the remaining of the OpenShift infrastructure.
While MCE/ACM are strategic in the lifecycle management of OpenShift, including the provisioning of all the clusters, the first cluster where MCE/ACM are hosted, along with other supporting tools to the rest of the clusters (GitOps, Quay, log centralisation, monitoring...) must be easy and with a high success rate.
The Assisted Installer and the Agent-based installers cover this gap and must present the option to enable MCE to keep making progress in this direction.
MCE engineering is responsible for adding the appropriate definition as an olm-operator-plugins
See https://github.com/openshift/assisted-service/blob/master/docs/dev/olm-operator-plugins.md for more details
In this feature will follow up OCPBU-186 Image mirroring by tags.
OCPBU-186 implemented new API ImageDigestMirrorSet and ImageTagMirrorSet and rolling of them through MCO.
This feature will update the components using ImageContentSourcePolicy to use ImageDigestMirrorSet.
The list of the components: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing.
Migrate OpenShift Components to use the new Image Digest Mirror Set (IDMS)
This doc list openshift components currently use ICSP: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing
Plan for ImageDigestMirrorSet Rollout :
Epic: https://issues.redhat.com/browse/OCPNODE-521
4.13: Enable ImageDigestMirrorSet, both ICSP and ImageDigestMirrorSet objects are functional
4.14: Update OpenShift components to use IDMS
4.17: Remove support for ICSP within MCO
As an openshift developer, I want --idms-file flag so that I can fetch image info from alternative mirror if --icsp-file gets deprecated.
As a <openshift developer> trying to <mirror image for disconnect environment using oc command> I want <the output give the example of ImageDigestMirrorSet manifest> because ImageContentSourcePolicy will be replaced by CRD implemented in OCPBU-186 Image mirroring by tags
the ImageContentSourcePolicy manifest snippet from the command output will be updated to ImageDigestMirrorSet manifest.{}
workloads uses `oc adm release mirror` command will be impacted.
This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.
Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.
Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.
This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.
Focus on the general modernization of the codebase, addressing technical debt, and ensuring that the platform is easy to maintain and extend.
DoD:
Delete conversion webhook https://github.com/openshift/hypershift/pull/2267
This needs to be backward compatible for IBM.
Review IBM PRs: * https://github.com/openshift/hypershift/pull/1939
Improve the consistency and reliability of APIs by enforcing immutability and clarifying service publish strategy support.
We need to clarify what's supported.
https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1685371010883569?thread_ts=1685363583.881959&cid=C04EUL1DRHC
Then lock down the API accordingly.
Due to low customer interest of using Openshift on Alibaba cloud we have decided to deprecate then remove the IPI support for ALibaba Cloud
4.14
Announcement
4.15
Archive code
Add a warning of depreciation in installer code for anyone trying to install Alibaba via IPI
The deprecation of support for the Alibaba Cloud platform is being postponed by one release, so we need to revert SPLAT-1094.
{}USER STORY:{}
As an user of the installer binary, I want to be warned that Alibaba support will be deprecated in 4.15, so that I'm prevented from creating clusters that will soon be unsupported.
{}DESCRIPTION:{}
Alibaba support will be decommissioned from both IPI and UPI starting in 4.15. We want to warn users of the 4.14 installer binary picking 'alibabacloud' in the list of providers.
{}ACCEPTANCE CRITERIA:{}
Warning message is displayed after choosing 'alibabacloud'.
{}ENGINEERING DETAILS:{}
The storage operators need to be automatically restarted after the certificates are renewed.
From OCP doc "The service CA certificate, which issues the service certificates, is valid for 26 months and is automatically rotated when there is less than 13 months validity left."
Since OCP is now offering an 18 months lifecycle per release, the storage operator pods need to be automatically restarted after the certificates are renewed.
The storage operators will be transparently restarted. The customer benefit should be transparent, it avoids manually restart of the storage operators.
The administrator should not need to restart the storage operator when certificates are renew.
This should apply to all relevant operators with a consistent experience.
As an administrator I want the storage operators to be automatically restarted when certificates are renewed.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
This feature request is triggered by the new extended OCP lifecycle. We are moving from 12 to 18 months support per release.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
No doc is required
This feature only cover storage but the same behavior should be applied to every relevant components.
The pod `csi-snapshot-webhook` mounts the secret:
```
$ cat assets/webhook/deployment.yaml kind: Deployment metadata: name: csi-snapshot-webhook ... spec: template: spec: containers: volumeMounts: - name: certs mountPath: /etc/snapshot-validation-webhook/certs volumes: - name: certs secret: secretName: csi-snapshot-webhook-secret
```
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted.
1. The pod `vmware-vsphere-csi-driver-controller` mounts the secret:
$ oc get po -n openshift-cluster-csi-drivers vmware-vsphere-csi-driver-controller-8467ddf4c-5lgd8 -o yaml ... containers: name: driver-kube-rbac-proxy name: provisioner-kube-rbac-proxy name: attacher-kube-rbac-proxy name: resizer-kube-rbac-proxy name: snapshotter-kube-rbac-proxy name: syncer-kube-rbac-proxy volumeMounts: - mountPath: /etc/tls/private name: metrics-serving-cert volumes: - name: metrics-serving-cert secret: defaultMode: 420 secretName: vmware-vsphere-csi-driver-controller-metrics-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted.
2. Similarly, the pod `vmware-vsphere-csi-driver-webhook` mounts another secret:
$ oc get po -n openshift-cluster-csi-drivers vmware-vsphere-csi-driver-webhook-c557dbf54-crrxp -o yaml ... containers: name: vsphere-webhook volumeMounts: - mountPath: /etc/webhook/certs name: certs volumes: - name: certs secret: defaultMode: 420 secretName: vmware-vsphere-csi-driver-webhook-secret
Again, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted.
The pod `shared-resource-csi-driver-node` mounts the secret:
$ cat assets/node.yaml ... containers: - name: hostpath volumeMounts: - mountPath: /etc/secrets name: shared-resource-csi-driver-node-metrics-serving-cert volumes: - name: shared-resource-csi-driver-node-metrics-serving-cert secret: defaultMode: 420 secretName: shared-resource-csi-driver-node-metrics-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted
The pod `gcp-pd-csi-driver-controller` mounts the secret:
$ oc get po -n openshift-cluster-csi-drivers gcp-pd-csi-driver-controller-5787b9c477-q78qx -o yaml ... name: provisioner-kube-rbac-proxy ... volumeMounts: - mountPath: /etc/tls/private name: metrics-serving-cert volumes: - name: metrics-serving-cert secret: secretName: gcp-pd-csi-driver-controller-metrics-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted
The pod `openstack-manila-csi-controllerplugin` mounts the secret:
$ cat assets/controller.yaml ... containers: - name: provisioner-kube-rbac-proxy volumeMounts: - mountPath: /etc/tls/private name: metrics-serving-cert volumes: - name: metrics-serving-cert secret: secretName: manila-csi-driver-controller-metrics-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted
The pod `openstack-cinder-csi-driver-controller` mounts the secret:
$ oc get po/openstack-cinder-csi-driver-controller-689b897df8-cx5hl -oyaml|yq .spec.volumes - emptyDir: {} name: socket-dir - name: secret-cinderplugin secret: defaultMode: 420 items: - key: clouds.yaml path: clouds.yaml secretName: openstack-cloud-credentials - configMap: defaultMode: 420 items: - key: cloud.conf path: cloud.conf name: cloud-conf name: config-cinderplugin - configMap: defaultMode: 420 items: - key: ca-bundle.pem path: ca-bundle.pem name: cloud-provider-config optional: true name: cacert - name: metrics-serving-cert secret: defaultMode: 420 secretName: openstack-cinder-csi-driver-controller-metrics-serving-cert - configMap: defaultMode: 420 items: - key: ca-bundle.crt path: tls-ca-bundle.pem name: openstack-cinder-csi-driver-trusted-ca-bundle name: non-standard-root-system-trust-ca-bundle - name: kube-api-access-hz62v projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace - configMap: items: - key: service-ca.crt path: service-ca.crt name: openshift-service-ca.crt
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted
The pod `shared-resource-csi-driver-webhook` mounts the secret:
$ cat assets/webhook/deployment.yaml kind: Deployment metadata: name: shared-resource-csi-driver-webhook ... spec: template: spec: containers: volumeMounts: - mountPath: /etc/secrets/shared-resource-csi-driver-webhook-serving-cert/ name: shared-resource-csi-driver-webhook-serving-cert volumes: - name: shared-resource-csi-driver-webhook-serving-cert secret: secretName: shared-resource-csi-driver-webhook-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted.
The pod `alibaba-disk-csi-driver-controller` mounts the secret:
$ cat assets/controller.yaml ... containers: - name: provisioner-kube-rbac-proxy volumeMounts: - mountPath: /etc/tls/private name: metrics-serving-cert volumes: - name: metrics-serving-cert secret: secretName: alibaba-disk-csi-driver-controller-metrics-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted
The pod `aws-ebs-csi-driver-controller` mounts the secret:
$ oc get po -n openshift-cluster-csi-drivers aws-ebs-csi-driver-controller-559f74d7cd-5tk4p -o yaml ... name: driver-kube-rbac-proxy name: provisioner-kube-rbac-proxy name: attacher-kube-rbac-proxy name: resizer-kube-rbac-proxy name: snapshotter-kube-rbac-proxy volumeMounts: - mountPath: /etc/tls/private name: metrics-serving-cert volumes: - name: metrics-serving-cert secret: defaultMode: 420 secretName: aws-ebs-csi-driver-controller-metrics-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted
The pod `ibm-powervs-block-csi-driver-controller` mounts the secret:
$ cat assets/controller.yaml ... containers: - name: provisioner-kube-rbac-proxy volumeMounts: - mountPath: /etc/tls/private name: metrics-serving-cert volumes: - name: metrics-serving-cert secret: secretName: ibm-powervs-block-csi-driver-controller-metrics-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted
The pod `azure-file-csi-driver-controller` mounts the secret:
$ oc get po -n openshift-cluster-csi-drivers azure-file-csi-driver-controller-cf84d5cf5-pzbjn -o yaml ... containers: name: driver-kube-rbac-proxy volumeMounts: - mountPath: /etc/tls/private name: metrics-serving-cert volumes: secret: defaultMode: 420 secretName: azure-file-csi-driver-controller-metrics-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted
Track goals/requirements for self-managed GA of Hosted control planes on AWS using the AWS Provider.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Today upstream and the more complete documentation of HyperShift lives on https://hypershift-docs.netlify.app/.
However product documentation today live under https://access.redhat.com/login?redirectTo=https%3A%2F%2Faccess.redhat.com%2Fdocumentation%2Fen-us%2Fred_hat_advanced_cluster_management_for_kubernetes%2F2.6%2Fhtml%2Fmulticluster_engine%2Fmulticluster_engine_overview%23hosted-control-planes-intro
The goal of this Epic is to extract important docs and establish parity between what's documented and possible upstream and product documentation.
Multiple consumers have not realised a newer version of a CPO (spec.release) is not guaranteed to work with an older HO.
This is stated here https://hypershift-docs.netlify.app/reference/versioning-support/
but empiric evidences like OCM integration are telling us this is not enough.
We already deploy a CM in the HO namespace with the HC supported versions.
Additionally we can add an image label with latest HC version supported by the operator so you can quickly docker inspect...
Console enhancements based on customer RFEs that improve customer user experience.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Based on https://issues.redhat.com/browse/RFE-3775 we should be extending our proxy package timeout to match the browser's timeout, which is 5 minutes.
AC: Bump the 30second timeout in the proxy pkg to 5 minutes
For the console, we would like to have a way for customers to send direct feedback about features like multi cluster.
Acceptance criteria:
Testing instructions:
{}According to security it is important to disable publicly available content from OpenShift Web Console which is available through: `/opt/bridge/bin/bridge --public-dir=/opt/bridge/static --config=/var/console-config` in the console pod (openshift-console namespace).
The folder /opt/bridge/static and its files are publicly available without authentication.
The purpose of this RFE is to disable the static assets:
https://console-openshift-console.apps.example.com/static/assets/
https://console-openshift-console.apps.example.com/static/
Follow on to CONSOLE-2976
Based on the API changes for MCP we need to check for item with`kube-apiserver-to-kubelet-signer` value for the `subject` key in `status.certExpirys` array. For that array we will render the `expiry` value which is in UTC format, as a timestamp.
AC:
We currently implement fuzzy search in the console (project search, search resources page / list view pages). While we don't want to change the current search behavior, we would like to add some exact search capability for users that have similarly named resources where fuzzy search doesn't help narrow down the list of resources in a list view/search page.
RFE: https://issues.redhat.com/browse/RFE-3013
Customer bug: https://issues.redhat.com/browse/OCPBUGS-2603
Acceptance criteria:
all search pages in console implement
Design
Explore help text for search inputs - this should be shown at all times and not hidden in popover
Extend the actual Installer's capabilities while deploying OCP on a GCP shared VPC (XPN) adding support to BYO hosted zones and removing the SA requirements in the bootstrap process.
While deploying OpenShift to a shared VPC (XPN) in GCP, the user can bring their own DNS zone where to create the required records for the API server and Ingress and no additional SA will be required to bootstrap the cluster.
The user can provide an existing DNS zone when deploying OpenShift to a shared VPC (XPN) in GCP that will be used to host the required DNS records for the API server and Ingress. At the same time, the SA today's requirements will be removed.
While adding support to shared VPC (XPN) deployments in GCP the BYO hosted zone capability was removed CORS-2474 due to multiple issues found during the QE phase validation for the the feature. At that time there was no evidence from customers/users on this being required for the shared VPC use case and this capability was removed in order to declare this feature GA.
We now have evidence from this specific use case being required by users.
Documentation about using this capability while deploying OpenShift to a shared VPC will be required.
The GCP bootstrap process creates a service account with the role roles/storage.admin . The role is required so that the service account can create a bucket to hold the bootstrap ignition file contents. As a security request from a customer, the service account created during this process can be removed. These details mean that the not only will the service account, private key, and role not be created, but the bucket containing the bootstrap ignition file contents will not be created in terraform.
As a (user persona), I want to be able to:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
This is a followup to https://issues.redhat.com/browse/OPNET-13. In that epic we implemented limited support for dual stack on VSphere, but due to limitations in upstream Kubernetes we were not able to support all of the use cases we do on baremetal. This epic is to track our work up and downstream to finish the dual stack implementation.
This is a followup to https://issues.redhat.com/browse/OPNET-13. In that epic we implemented limited support for dual stack on VSphere, but due to limitations in upstream Kubernetes we were not able to support all of the use cases we do on baremetal. This epic is to track our work up and downstream to finish the dual stack implementation.
Currently o/installer only allows primary-v6 installations for baremetal and not for vSphere (https://github.com/openshift/installer/blob/release-4.13/pkg/types/validation/installconfig.go#L241). We need to change it so that such a topology is also allowed in vSphere
With the changes being behind alpha feature, we need to make sure CCM is running with the feature gate enabled.
Relevant manifest for vSphere - https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/master/pkg/cloud/vsphere/assets/cloud-controller-manager-deployment.yaml
The Assisted Installer is used to help streamline and improve the install experience of OpenShift UPI. Given the install footprint of OpenShift on IBM Power and IBM zSystems we would like to bring the Assisted Installer experience to those platforms and easy the installation experience.
Full support of the Assisted Installer for use by IBM Power and IBM zSystems
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a multi-arch development engineer, I would like to evaluate if the assisted installer is a good fit for simplifying UPI deployments on Power and Z.
Acceptance Criteria
Description of the problem:
How reproducible:
100%
Steps to reproduce:
1. install 2 clusters with power and z CPU architectures and check the feature usage dashboard in the elastic
Actual results:
power and z features are not displayed in the feature usage dashboard in the elastic
Expected results:
see the power and z features in the feature usage dashboard in the elastic
After doing more tests on staging for Power, I have found that the cluster managed network would not work for Power, it uses the platform.baremetal to define API-VIP/INGRESS-VIP, most the installations have failed at the last step finalizing. After more dig, found that the machine-api operator would not be able to start successfully, and stay in Operator is initializing state, here is the list of the pod with error:
openshift-kube-controller-manager installer-5-master-1 0/1 Error 0 25m
openshift-kube-controller-manager installer-6-master-2 0/1 Error 0 17m
openshift-machine-api ironic-proxy-kgm9g 0/1 CreateContainerError 0 32m
openshift-machine-api ironic-proxy-nc2lz 0/1 CreateContainerError 0 8m37s
openshift-machine-api ironic-proxy-pp92t 0/1 CreateContainerError 0 32m
openshift-machine-api metal3-69b945c7ff-45hqn 1/5 CreateContainerError 0 33m
openshift-machine-api metal3-image-customization-7f6c8978cf-lxbj7 0/1 CreateContainerError 0 32m
the messages from failed pod ironic-proxy-nc2lz:
Normal Pulled 11m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4f84fd895186b28af912eea42aba1276dec98c814a79310c833202960cf05407" in 1.29310959s (1.293135461s including waiting)
Warning Failed 11m kubelet Error: container create failed: time="2023-04-06T15:16:19Z" level=error msg="runc create failed: unable to start container process: exec: \"/bin/runironic-proxy\": stat /bin/runironic-proxy: no such file or directory"
similar errors for other failed pods.
The interesting thing is some of the installation got installed in AI successfully, but these pods still are in error state.
So I ask AI team to turn off the support Cluster network support for Power.
We're currently on etcd 3.5.6, since then there has been at least another newer release. This epic description is to track changes that we need to pay attention to:
Golang 1.17 update
In 3.5.7 etcd was moved to 1.17 to address some vulnerabilities:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#go
We need to update our definitions in the release repo to match this and test what impact it has.
EDIT: now moving onto 1.19 directly: https://github.com/etcd-io/etcd/pull/15337
WAL fix carry
3.5.6 had a nasty WAL bug that was hit by some customers, fixed with https://github.com/etcd-io/etcd/pull/15069
Due to the Golang upgrade we carried that patch through OCPBUGS-5458
When we upgrade we need to ensure the commits are properly handled and ordered with this carry.
IPv6 Formatting
There were some comparison issues with same IPv6 addresses having different formats. This was fixed in https://github.com/etcd-io/etcd/pull/15187 and we need to test what impact this has on our ipv6 based SKUs.
serializable memberlist
This is a carry we have for some time: https://github.com/openshift/etcd/commit/26d7d842f6fb968e55fa5dbbd21bd6e4ea4ace50
This is now officially fixed (slightly different) with the options pattern in: https://github.com/etcd-io/etcd/pull/15261
We need to drop the carry patch and take the upstream version when rebasing.
Etcd 3.5.8 has been released so we can now rebase openshift/etcd to that version
https://github.com/etcd-io/etcd/releases/tag/v3.5.8
The goal of this initiative to help boost adoption of OpenShift on ppc64le. This can be further broken down into several key objectives.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal
Running doc to describe terminologies and concepts which are specific to Power VS - https://docs.google.com/document/d/1Kgezv21VsixDyYcbfvxZxKNwszRK6GYKBiTTpEUubqw/edit?usp=sharing
Now with the recent changes into the CSI driver code, multipath become mandatory on the OCP nodes, hence installer needs to be generated the additional multipath machine config manifests to make csi driver work on powervs environment
https://github.com/openshift/ibm-powervs-block-csi-driver/pull/33
https://github.com/openshift/ibm-powervs-block-csi-driver/pull/36
Manifest information for the multipath configuration - https://docs.openshift.com/container-platform/4.13/post_installation_configuration/machine-configuration-tasks.html#rhcos-enabling-multipath-day-2_post-install-machine-configuration-tasks
As a part of adding CPMSO support https://issues.redhat.com/browse/MULTIARCH-3667, We need to update mapi CR to add ability to update and delete loadbalancer pool memebers
This feature aims to enhance and clarify the functionalities of the Hypershift CLI. It was initially developed as a developer tool, but as its purpose evolved, a mix of supported and unsupported features were included. This has caused confusion for users who attempt to utilize unsupported functionalities. The goal is to clearly define the boundaries of what is possible and what is supported by the product.
Users should be able to effectively and efficiently use the Hypershift CLI with a clear understanding of what features are supported and what are not. This should reduce confusion and complications when utilizing the tool.
Clear differentiation between supported and unsupported functionalities within the Hypershift CLI.
Improved documentation outlining the supported CLI options.
Consistency between the Hypershift CLI and the quickstart guide on the UI.
Security, reliability, performance, maintainability, scalability, and usability must not be compromised while implementing these changes.
A developer uses the hypershift install command and only supported features are executed.
A user attempts to create a cluster using hypershift cluster create, and the command defaults to a compatible release image.
What is the most efficient method for differentiating supported and unsupported features within the Hypershift CLI?
What changes need to be made to the documentation to clearly outline supported CLI options?
Changing the fundamental functionality of the Hypershift CLI.
Adding additional features beyond the scope of addressing the current issues.
The Hypershift CLI started as a developer tool but evolved to include a mix of supported and unsupported features. This has led to confusion among users and potential complications when using the tool. This feature aims to clearly define what is and isn't supported by the product.
Customers should be educated about the changes to the Hypershift CLI and its intended use. Clear communication about supported and unsupported features will help them utilize the tool effectively.
Documentation should be updated to clearly outline supported CLI options. This will be a crucial part of user education and should be easy to understand and follow.
This feature may impact the usage of Hypershift CLI across other projects and versions. A clear understanding of these impacts and planning for necessary interoperability test scenarios should be factored in during development.
As a self-managed HyperShift user I want to have a CLI tool that allows me to:
Definition of done:
As a HyperShift user I want to:
Definition of done:
As a user of HCP CLI, I want to be able to set some platform agnostic default flags when creating a HostedCluster:
so that I can set default values for these flags for my particular use cases.
Description of criteria:
The flags listed in HyperShift Create Cluster CLI that don't seem platform agnostic:
These flags are also out of scope:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a self-managed HyperShift user I want to have a CLI tool that allows me to:
Definition of done:
As a HyperShift user I want to:
Definition of done:
As a self-managed HyperShift user I want to have a CLI tool that allows me to:
Definition of done:
As a HyperShift user I want to:
Definition of done:
As a self-managed HyperShift user I want to have a CLI tool that allows me to:
Definition of done:
As a self-managed HyperShift user I want to have a CLI tool that allows me to:
Definition of done:
As a software developer and user of HyperShift CLI, I would like a prototype of how the Makefile can be modified to build different versions of the HyperShift CLI, i.e., dev version vs productized version.
As a HyperShift user I want to:
Definition of done:
As a self-managed HyperShift user I want to have a CLI tool that allows me to:
Definition of done:
Enable release managers/Operator authors to manage Operator releases in the file-based catalog (FBC) based on the existing catalog (in sqlite) and distribute them to multiple OCP versions at ease.
Requirement | Notes | isMvp? |
---|---|---|
A declarative mechanism to automate the catalog update process in file-based catalog (FBC) with newly-published bundle references. | Yes | |
A declarative mechanism to publish Operator releases in file-based catalog (FBC) to multiple OCP releases. | Yes | |
A declarative mechanism to convert file-based catalog (FBC) to sqlite database format so it can be publish to OCP versions without FBC supports. | Yes | |
A declarative mechanism to convert existing catalog from sqlite database to file-based catalog (FBC) basic template. | Yes | |
A declarative mechanism to convert existing catalog from sqlite database to file-based catalog (FBC) semver template when possible and/or highlights the uncompleted sections so users can easier identify the gaps. | NO | |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | Yes |
Release Technical Enablement | Provide necessary release enablement details and documents. | Yes |
A catalog maintainer frequently needs to make changes to an OLM catalog whenever a new software version is released, promoting an existing version and releasing it to a different channel, or deprecating an existing version. All these often require non-trivial changes to the update graph of an Operator package. The maintainers need a git- and human-friendly maintenance approach that allows reproducing the catalog at all times and is decoupled from the release of their individual software versions.
The original imperative catalog maintenance approach, which relies on `replaces`, `skips`, `skipRange` attributes at the bundle level to define the relationships between versions and the update channels, is perceived as complicated from the Red Hat internal developer community. Hence, the new file-based catalog (FBC) is introduced with a declarative fashion and GitOps-friendly.
Furthermore, the concept so-called “template”, as an abstraction layer of the FBC, is introduced to simplify interacting with FBCs. While the “basic template” serves as a simplified abstraction of an FBC with all the `replaces`, `skips`, `skipRange` attributes supported and configurable at the package level, the “semver template” provides the capability to auto-generate an entire upgrade graph adhering to Semantic Versioning (semver) guidelines and consistent with best practices on channel naming.
Based on the feedback in KubeCon NA 2022, folks were all generally excited to the features introduced with FBC and the UX provided by the templates. What is still missing is the tooling to enable the adoption.
Therefore, it is important to allow users to:
to help users adopt this novel file-based catalog approach and deliver value to customers with a faster release cadence and higher confidence.
Previous bundle deprecation was handled by assigning a property to the olm.bundle object of `olm.deprecated`. SQLite DBs had to have all valid upgrade edges supported by olm.bundle information in order to prevent foreign key violations. This property meant that the bundle was to be ignored & never installed.
FBC has a simpler method for achieving the same goal: don't include the bundle. Upgrade edges from it may still be specified, and the bundle will not be installable.
Likely an update to opm code base in the neighborhood of https://github.com/operator-framework/operator-registry/blob/249ae621bb8fa6fc8a8e4a5ae26355577393f127/pkg/sqlite/conversion.go#L80
A/C:
This feature will track upstream work from the OpenShift Control Plane teams - API, Auth, etcd, Workloads, and Storage.
To continue and develop meaningful contributions to the upstream community including feature delivery, bug fixes, and leadership contributions.
Note: The matchLabelKeys field is a beta-level field and enabled by default in 1.27. You can disable it by disabling the MatchLabelKeysInPodTopologySpread [feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/).
Removing from the TP as the feature is enabled by default.
Just a clean up work.
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) with VMs
Currently, we don't yet support OpenShift 4 on Oracle Cloud Infrastructure (OCI), and we know from initial attempts that installing OpenShift on OCI requires the use of a qcow (OpenStack qcow seems to work fine), networking and routing changes, storage issues, potential MTU and registry issues, etc.
TBD based on customer demand.
Why is this important
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
RFEs:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Other
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of the problem:
Currently, the infrastructure object is create as following:
# oc get infrastructure/cluster -oyaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-06-19T13:49:07Z" generation: 1 name: cluster resourceVersion: "553" uid: 240dc176-566e-4471-b9db-fb25c676ba33 spec: cloudConfig: name: "" platformSpec: type: None status: apiServerInternalURI: https://api-int.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443 apiServerURL: https://api.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443 controlPlaneTopology: HighlyAvailable cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: test-infra-cluster-97-w6b42 infrastructureTopology: HighlyAvailable platform: None platformStatus: type: None
instead it should be similar to:
# oc get infrastructure/cluster -oyaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-06-19T13:49:07Z" generation: 1 name: cluster resourceVersion: "553" uid: 240dc176-566e-4471-b9db-fb25c676ba33 spec: cloudConfig: name: "" platformSpec: type: External external: platformName: oci status: apiServerInternalURI: https://api-int.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443 apiServerURL: https://api.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443 controlPlaneTopology: HighlyAvailable cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: test-infra-cluster-97-w6b42 infrastructureTopology: HighlyAvailable platform: External platformStatus: type: External external: cloudControllerManager: state: External
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
We currently rely on a hack to deploy a cluster on external platform: https://github.com/openshift/assisted-service/pull/5312
The goal of this ticket is to move the definition of the external platform in in the installer-config on the openshift installer is released with the support of external platform: https://github.com/openshift/installer/pull/7217
The taint here: https://github.com/openshift/assisted-installer/pull/629/files#diff-1046cc2d18cf5f82336bbad36a2d28540606e1c6aaa0b5073c545301ef60ffd4R593
should only be removed when platform is nutanix or vsphere because the credentials for these platforms are passed after cluster installation.
In the opposite with Oracle Cloud the instance gets its credentials through the instance metadata, and should be able to label the nodes from the beginning of the installation without any user intervention.
Description of the problem:
The features API tells us that EXTERNAL_PLATFORM_OCI is supported for version 4.14 and the s390x cpu architecture but the attempt to create the cluster fails with "Can't set oci platform on s390x architecture".
Steps to reproduce:
1. Register cluster with OCI platform and z architecture
There are 2 options to detect if the hosts are running on OCI:
1/ On OCI, the machine will have the following chassis-asset-tag:
# dmidecode --string chassis-asset-tag OracleCloud.com
In the agent, we can override hostInventory.SystemVendor.Manufacturer when chassis-asset-tag="OracleCloud.com".
2/ Read instance metadata: curl -v -H "Authorization: Bearer Oracle" http://169.254.169.254/opc/v2/instance
It will allow the auto-detection of the platform from the provider in assisted-service, and validate that hosts are running in OCI when installing a cluster with platform=oci
Update is_external API description to something less confusing: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1687452115222069
Description of the problem:
I've tested a cluster with platform type = 'baremetal' and hosts discovered. Then, when I try to change to Nutanix platform, BE returns an error
How reproducible:
100%
Steps to reproduce:
1. Create cluster without platform integration
2. Discover 3 hosts
3. Try to change platform to 'Nutanix'
Actual results:
API returns an error.
Expected results:
We can change platform type, this change should be agnostic to the discovered hosts.
Based on the feature support matrix, implement the validations in the assisted service
External platform will be available behind TechPreviewNoUpgrade feature set, automatically enable this falg in the installer config when oci platform is selected.
Currently the API call "GET /v2/clusters/{cluster_id}/supported-platforms" returns the hosts supported platforms regardless of the other cluster parameters
In order to install oracle CCM driver, we need the ability to set the platform to "external" in the install-config.
The platform need to be added here: https://github.com/openshift/assisted-service/blob/3496d1d2e185343c6a3b1175c810fdfd148229b2/internal/installcfg/installcfg.go#L8
Slack thread: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1678801176091619
The goal of this ticket is to check if besides external platform, the AI can install the CCM, and document it.
Unify and update hosted control planes storage operators so that they have similar code patterns and can run properly in both standalone OCP and HyperShift's control plane.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal*
In 4.12 we tried several approaches how to write an operator that works both in standalone OCP and in HyperShift's control plane running in the management cluster.
These operators need to be changed:
We need to unify the operators to use similar approach, so the code in our operators look the same.
In addition, we need to update the operators to:
Why is this important? (mandatory)
It will simplify our operators - we will have the same pattern in all of them.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
OCP regression tests work, both on standalone OCP and HyperShift.
Drawbacks or Risk (optional)
We could introduce regressions
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
We should refactor CSO, so as to remove duplication of code for hypershift and standalone deployments.
We also are going to reduce duplication of manifests so as templates can be reused between hypershift and standalone clusters.
This feature is the place holder for all epics related to technical debt associated with Console team
Outcome Overview
Once all Features and/or Initiatives in this Outcome are complete, what tangible, incremental, and (ideally) measurable movement will be made toward the company's Strategic Goal(s)?
Success Criteria
What is the success criteria for this strategic outcome? Avoid listing Features or Initiatives and instead describe "what must be true" for the outcome to be considered delivered.
Expected Results (what, how, when)
What incremental impact do you expect to create toward the company's Strategic Goals by delivering this outcome? (possible examples: unblocking sales, shifts in product metrics, etc. + provide links to metrics that will be used post-completion for review & pivot decisions). {}For each expected result, list what you will measure and when you will measure it (ex. provide links to existing information or metrics that will be used post-completion for review and specify when you will review the measurement such as 60 days after the work is complete)
Post Completion Review – Actual Results
After completing the work (as determined by the "when" in Expected Results above), list the actual results observed / measured during Post Completion review(s).
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
Added client certificates based on https://github.com/deads2k/openshift-enhancements/blob/master/enhancements/monitoring/client-cert-scraping.md
The custom history.pushPath function in public/components/utils/router.ts does not seem to have any purpose other than being an alias to the standard history.push function.
Bump to the latest Typescript version (or at least 4.5). We'll need to update the TypeScript dependency and any packages that need to be updated.
https://devblogs.microsoft.com/typescript/announcing-typescript-4-1/
TypeScript releases: https://github.com/microsoft/TypeScript/releases
Goal
Guided installation user experience that interacts via prompts for necessary inputs, informs of erroneous/invalid inputs, and provides status and feedback throughout the installation workflow with very few steps, that works for disconnected, on-premises environments.
Installation is performed from a bootable image that doesn't contain cluster details or user details, since these details will be collected during the installation flow after booting the image in the target nodes.
This means that the image is generic and can be used to install an OpenShift cluster in any supported environment.
Why is this important?
Customers/partners desire a guided installation experience to deploy OpenShift with a UI that includes support for disconnected, on-premises environments, and which is as flexible in terms of configuration as UPI.
We have partners that need to provide an installation image that can be used to install new clusters on any location and for any users, since their business is to sell the hardware along with OpenShift, where OpenShift needs to be installable in the destination premises.
Acceptance Criteria
This experience should provide an experience closely matching the current hosted service (Assisted Installer), with the exception that it is limited to a single cluster because the host running the service will reboot and become a node in the cluster as part of the deployment process.
Dependencies
(If the former option is selected, the IP address should be displayed so that it can be entered in the other hosts.)
Currently we use templating to set the NodeZero IP address in a number of different configuration files and scripts.
We should move this configuration to a single file (/etc/assisted/rendezvous-host.env) and reference it only from there, e.g. as a systemd environment file.
We also template values like URLs, because it is easier and safer to do this in golang (e.g. to use an IP address that may be either IPv4 or IPv6 in a URL) than in bash. We may need to include all of these variables in the file.
This will enable us to interactively configure the rendezvousIP in a single place.
Block services that depend on knowing the rendezvousIP from starting until the rendezvousIP configuration file created in AGENT-555 exists. This will probably take the form of just looping in node-zero.service until the file is present. The systemd configuration may need adjustments to prevent the service from timing out.
While we are waiting, a message should be displayed on the hardware console indicating what is happening.
Modify the cluster registration code in the assisted-service client (used by create-cluster-and-infraenv.service) to allow creating the cluster given only the following config manifests:
If the following manifests are present, data from them should be used:
Other manifests (ClusterDeployment, AgentClusterInstall) will not be present in an interactive install, and the information therein will be entered via the GUI instead.
A CLI flag or environment variable can be used to select the interactive mode.
The Control Plane MachineSet enables OCP clusters to scale Control plane machines. This epic is about making the Control Plane MachineSet controller work with OpenStack.
The Control Plane MachineSet enables OCP clusters to scale Control plane machines. This epic is about making the Control Plane MachineSet controller work with OpenStack.
The FailureDomain API that was introduced in 4.13 was TechPreview and is now replaced by an API in openshift/api; not in the installer anymore.
Therefore, we want to clean up the installer from any unsupported API so later we can add the supported API in order to add support for CPMS on OpenStack.
Add the OpenStack FailureDomain into CPMSO
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown
As an OpenShift developer, I want to have confidence that the graceful restart feature works and stays working in the future through various code changes. To that end, please add at least the following 2 E2E tests:
Track goals/requirements for self-managed GA of Hosted control planes on BM using the agent provider. Mainly make sure:
This Section:
Customers are looking at HyperShift to deploy self-managed clusters on Baremetal. We have positioned the Agent flow as the way to get BM clusters due to its ease of use (it automates many of the rather mundane tasks required to setup up BM clusters) and its planned for GA with MCE 2.3 (in the OCP 4.13 timeframe).
Questions to be addressed:
Group all tasks for CAPI-provider-agent GA readiness
no
Feature origin (who asked for this feature?)
The test wait until all pods in the control plane namespace report ready status but collect-profiles is a job that sometimes complete before other pods are ready.
Once the collect-profiles pod is completed it termintates and the status moves to ready=false.
And from there onwards the test is stuck.
Support Dual-Stack Networking (IPv4 & IPv6) for hosted control planes.
Many of our customer,especially Telco providers have a need to support IPv6 but can't do so immediately, they would still have legacy IPv4 workload. To support both stacks, an OpenShift cluster must be capable of allowing communication for both flavors. I.e., a OpenShift cluster running with hosted control planes should be able to allow workloads to access both IP stacks.
As a cluster operator, you have the option to expose external endpoints using one or both address families, in any order that suits your needs. OpenShift does not make any assumptions about the network it operates on. For instance, if you have a small IPv4 address space, you can enable dual-stack on some of your cluster nodes and have the rest running on IPv6, which typically has a more extensive address space available.
Description of problem:
When deploying a dual stack HostedCluster the KAS certificate won't be created with the proper SAN. If we look into a regular dual-stack cluster we can see the certificate gets generated as follows: X509v3 Subject Alternative Name: DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:openshift, DNS:openshift.default, DNS:openshift.default.svc, DNS:openshift.default.svc.cluster.local, DNS:172.30.0.1, DNS:fd02::1, IP Address:172.30.0.1, IP Address:FD02:0:0:0:0:0:0:1 whereas in a dual-stack hosted cluster this is the SAN: X509v3 Subject Alternative Name: DNS:localhost, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:kube-apiserver, DNS:kube-apiserver.clusters-hosted.svc, DNS:kube-apiserver.clusters-hosted.svc.cluster.local, DNS:api.hosted.dual.lab, DNS:api.hosted.hypershift.local, IP Address:127.0.0.1, IP Address:172.31.0.1 As you can see it's missing the IPv6 pod+service IP on the certificate. This causes issues on some controllers when contacting the KAS. example: E0711 16:51:42.536367 1 reflector.go:140] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://172.31.0.1:443/api/v1/services?limit=500&resourceVersion=0": x509: cannot validate certificate for 172.31.0.1 because it doesn't contain any IP SANs
Version-Release number of selected component (if applicable):
latest
How reproducible:
Always
Steps to Reproduce:
1. Deploy a HC with the networking settings specified and using the image with dual stack patches included quay.io/jparrill/hypershift:OCPBUGS-15331-mix-413v4
Actual results:
KubeApiserver cert gets generated with the wrong SAN config.
Expected results:
KubeApiserver cert gets generated with the correct SAN config.
Additional info:
Description of problem:
Installing a 4.14 self-managed hosted cluster on a dual-stack hub with the "hypershift create cluster agent" command. The logs of the hypershift operator pod show a bunch of these errors: {"level":"error","ts":"2023-06-08T13:36:26Z","msg":"Reconciler error","controller":"hostedcluster","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","hostedCluster":{"name":"hosted-0","namespace":"clusters"},"namespace":"clusters","name":"hosted-0","reconcileID":"a0a0f44f-7bbe-499f-95b0-e24b793ee48c","error":"failed to reconcile network policies: failed to reconcile kube-apiserver network policy: NetworkPolicy.extensions \"kas\" is invalid: spec.egress[1].to[0].ipBlock.except[1]: Invalid value: \"fd01::/48\": must be a strict subset of `cidr`","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"} The hostedcluster CR is showing the same ReconciliationError. Note that the networking section in the hostedcluster CRD created by the "hypershift create cluster agent" command has ipv4 CIDR: networking: clusterNetwork: - cidr: 10.132.0.0/14 networkType: OVNKubernetes serviceNetwork: - cidr: 172.31.0.0/16 while services have ipv6 nodeport addresses.
Version-Release number of selected component (if applicable):
$ oc version Client Version: 4.14.0-0.nightly-2023-06-05-112833 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2023-06-05-112833 Kubernetes Version: v1.27.2+cc041e8
How reproducible:
100%
Steps to Reproduce:
1. Install 4.14 OCP dual-stuck BM hub cluster 2. Install MCE 2.4 and Hypershift operator 3. Install hosted cluster with "hypershift create cluster agent" command
Actual results:
hosted cluster CR shows ReconciliationError: - lastTransitionTime: "2023-06-08T10:55:33Z" message: 'failed to reconcile network policies: failed to reconcile kube-apiserver network policy: NetworkPolicy.extensions "kas" is invalid: spec.egress[1].to[0].ipBlock.except[1]: Invalid value: "fd01::/48": must be a strict subset of `cidr`' observedGeneration: 2 reason: ReconciliationError status: "False" type: ReconciliationSucceeded
Expected results:
ReconciliationSucceeded condition should be True
Additional info:
Logs and CRDs produced by the failed job: https://s3.upshift.redhat.com/DH-PROD-OCP-EDGE-QE-CI/ocp-spoke-assisted-operator-deploy/8044/post-mortem.zip
Description of problem:
When deploying a dual stack HostedCluster the worker nodes will not fully join the cluster because the CNI plugin doesn't start. If we check the cluster-network-operator pod we will see the following error: I0711 13:46:16.012420 1 log.go:198] Failed to validate Network.Spec: hostPrefix 23 is larger than its cidr fd01::/48 It seems that is validating the IPv4 hostPrefix against the IPv6 pod network, this is how the networking spec for the HC looks like: networking: clusterNetwork: - cidr: 10.132.0.0/14 - cidr: fd01::/48 networkType: OVNKubernetes serviceNetwork: - cidr: 172.31.0.0/16 - cidr: fd02::/112
Version-Release number of selected component (if applicable):
latest
How reproducible:
Always
Steps to Reproduce:
1. Deploy a HC with the networking settings specified and using the image with dual stack patches included quay.io/jparrill/hypershift:OCPBUGS-15331-mix-413v2
Actual results:
CNI is not deployed
Expected results:
CNI is deployed
Additional info:
Discussed on slack https://redhat-internal.slack.com/archives/C058TF9K37Z/p1689078655055779
To run a HyperShift management cluster in disconnected mode we need to document which images need to be mirrored and potentially modify the images we use for OLM catalogs.
ICSP mapping only happens for image references with a digest, not a regular tag. We need to address this for images we reference by tag:
CAPI, CAPI provider, OLM catalogs
Currently OLM catalogs placed in the control plane use image references to a tag so that the latest can be pulled when the catalog is restarted. There is a CRON job that restarts the deployment on a regular basis.
The issue with this, is that the image cannot be mirrored for offline deployments, nor can it be used in environments (IBM cloud) where all images running on a management cluster need to be approved beforehand by digest.
As a user of Hosted Control Planes, I would like the HCP Specification API. to support both ICSP & IDMS.
IDMS is replacing ICSP in OCP 4.13+. hcp.Spec.ImageContentSources was updated in OCPBUGS-11939 to replace ICSP with IDMS. This needs to be reverted and something new added to support IDMS in addition to ICSP.
Description of problem:
HostedClusterConfigOperator doesn't check OperatorHub object in the Hosted Cluster. This causes that default catalogsources cannot be disabled. If there are failing catalogsources, operator deployments might be impacted.
Version-Release number of selected component (if applicable):
Any
How reproducible:
Always
Steps to Reproduce:
1. Deploy a HostedCluster 2. Connect to the hostedcluster and patch the operatorhub object: `oc --kubeconfig ./hosted-kubeadmin patch OperatorHub cluster --type json -p '[{"op": "add", "path": "/spec/disableAllDefaultSources", "value": true}]'` 3. CatalogSources objects won't be removed from the openshift-marketplace namespace.
Actual results:
CatalogSources objects are not removed from the openshift-marketplace namespace.
Expected results:
CatalogSources objects are removed from the openshift-marketplace namespace.
Additional info:
This is the code where we can see that the reconcile will create the catalogsources everytime. https://github.com/openshift/hypershift/blob/dba2e9729024ce55f4f2eba8d6ccb8801e78a022/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L1285
As a user of hosted clusters in disconnected environments, I would like RegistryClientImageMetadataProvider to support registry overrides so that registry lookups utilize the registries in the registry overrides rather than what might be listed in the image reference.
Description of problem:
When user configures HostedCluster.Spec.additionalTrustBundle, there are some deployments that add this trust bundle using a volume. The ignition-server deployment won't add this volume.
Version-Release number of selected component (if applicable):
Any
How reproducible:
Always
Steps to Reproduce:
1. Deploy a HostedCluster with additionalTrustBundle 2. Check ignition-server deployment configuration
Actual results:
No trust bundle configured
Expected results:
Trust bundle configuered.
Additional info:
There is missing code. Ignition-server-proxy does configure the trust bundle: https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/hostedcluster/ignitionserver/ignitionserver.go#L745-L748 Ignition-server does not: https://github.com/openshift/hypershift/blob/main/control-plane-operator/controllers/hostedcontrolplane/ignitionserver/ignitionserver.go#L694
Phase 2 Goal:
for Phase-1, incorporating the assets from different repositories to simplify asset management.
Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
In the cluster-capi-operator repository are present several CAPI E2E tests for specific providers.
We run these tests on every PR that lands on that repository.
In order to test rebases for the cluster-api providers we want to run these tests also there to prove rebase PRs are not breaking CAPI functionality.
DoD:
CgroupV2 is GA as of OCP 4.13 .
RHEL 9 is defaulted to V2 and we want to make sure we are in sync
V1 support in system d will end by end of 2023
What need to be done
https://docs.google.com/document/d/1i6IEGjaM0-NeMqzm0ZnVm0UcfcVmZqfQRG1m5BjKzbo/edit
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This issue consists of the following changes.
Add support to the OpenShift Installer to set up the field 'managedBy' on the Azure Resource Group
As a user I want to be able to provide a new field to the Installer's manifest to be used to configure the `managedBy` tag into the Azure Resource Group
The Installer will provide a new field via the Install Config manifest to be used for tag the Azure Resource Group
This is a requirement for the ARO SRE teams for their automation tool to identify these resources.
ARO needs this field set for their automation tool in the background. Doc for more details.
This new additional field will need to be documented as any other field supported via the Install Config manifest
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a ARO developer, I want to be able to:
so that
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Catch-all epic for cleanup work around the now non-machineconfig certificate bundles written by the MCO (kubelet, image registry)
Once we remove the certificates from the MachineConfigs, the controllerconfig would be the canonical location for all certificates.
We should increase visibility by potentially adding either a new configmap so all components + console/users can read, or bubbling up status better in the controllerconfig object itself
MVP aims at refactoring MirrorToDisk and DiskToMirror for OCP releases
As an MVP, this epic covers the work for RFE-3800 (includes RFE-3393 and RFE-3733) for mirroring releases.
The full description / overview of the enclave support is best described here
The design document can be found here
Upcoming epics, such as CFE-942 will complete the RFE work with mirroring operators, additionalImages, etc.
Architecture Overview (diagram)
As a developer, I want to create an implementation based on a local container registry as the backing technology for mirroring to disk, so that:
Add support of NAT Gateways in Azure while deploying OpenShift on this cloud to manage the outbound network traffic and make this the default option for new deployments
While deploying OpenShift on Azure the Installer will configure NAT Gateways as the default method to handle the outbound network traffic so we can prevent existing issues on SNAT Port Exhaustion issue related to the configured outboundType by default.
The installer will use the NAT Gateway object from Azure to manage the outbound traffic from OpenShift.
The installer will create a NAT Gateway object per AZ in Azure so the solution is HA.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Using NAT Gateway for egress traffic is the recommended approach from Microsoft
This is also a common ask from different enterprise customers as with the actual solution used by OpenShift for outbound traffic management in Azure they are hitting SNAT Port Exhaustion issues.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
This work depends on the work done in CORS-2564
As a user, I want to be able to:
so that I can achieve
This requires/does not require a design proposal.
This requires/does not require a feature gate.
The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike.
Some customer cases have revealed scenarios where the MCO state reporting is misleading and therefore could be unreliable to base decisions and automation on.
In addition to correcting some incorrect states, the MCO will be enhanced for a more granular view of update rollouts across machines.
The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike.
For this epic, "state" means "what is the MCO doing?" – so the goal here is to try to make sure that it's always known what the MCO is doing.
This includes:
While this probably crosses a little bit into the "status" portion of certain MCO objects, as some state is definitely recorded there, this probably shouldn't turn into a "better status reporting" epic. I'm interpreting "status" to mean "how is it going" so status is maybe a "detail attached to a state".
Exploration here: https://docs.google.com/document/d/1j6Qea98aVP12kzmPbR_3Y-3-meJQBf0_K6HxZOkzbNk/edit?usp=sharing
https://docs.google.com/document/d/17qYml7CETIaDmcEO-6OGQGNO0d7HtfyU7W4OMA6kTeM/edit?usp=sharing
During upgrade tests, the MCO will become temporarily degraded with the following events showing up in the event log:
Dec 13 17:34:58.478 E clusteroperator/machine-config condition/Degraded status/True reason/RequiredPoolsFailed changed: Unable to apply 4.11.0-0.ci-2022-12-13-153933: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, pool master has not progressed to latest configuration: controller version mismatch for rendered-master-3c738a0c86e7fdea3b5305265f2a2cdb expected 92012a837e2ed0ed3c9e61c715579ac82ad0a464 has 768f73110bc6d21c79a2585a1ee678d5d9902ad5: 2 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-61c5ab699262647bf12ea16ea08f5782, retrying]
This seems to be occurring with some frequency as indicated by its prevalence in CI search:
$ curl -s 'https://search.ci.openshift.org/search?search=clusteroperator%2Fmachine-config+condition%2FDegraded+status%2FTrue+reason%2F.*controller+version+mismatch&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=%5E%28periodic%7Crelease%29.*4%5C.1%5B1%2C2%5D.*&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=job' | jq 'keys | length' 399
The MCO should not become degraded during an upgrade unless it cannot proceed with the upgrade. In the case of these failures, I think we're timing out at some point during node reboots as either 1 or 2 of the control plane nodes are ready, with the third being unready. The MCO eventually requeues the syncRequiredMachineConfigPools step and the remaining nodes reboot and the MCO eventually clears the Degraded status.
Indeed, looking at the event breakdown, one can see that control plane nodes take ~21 minutes to roll out their new config with OS upgrades. By comparison, the worker nodes take ~15 minutes.
Meanwhile, the portion of the MCO which performs this sync (the syncRequiredMachineConfigPools function) has a hard-coded timeout of 10 minutes. Additionally, to my understanding, there is an additional 10 minute grace period before the MCO marks itself as degraded. Since the control plane nodes took ~21 minutes to completely reboot and roll out their new configs, we've exceeded the time needed. With this in mind, I propose a path forward:
When the cluster does not have v1 builds, console needs to either provide different ways to build applications or prevent erroneous actions.
Identify the build system in place and prompt user accordingly when building applications.
Console will have to hide any workflows that rely solely on buildconfigs and pipelines is not installed.
ODC Jira - https://issues.redhat.com/browse/ODC-7352
When the cluster does not have v1 builds, console needs to either provide different ways to build applications or prevent erroneous actions.
Identify the build system in place and prompt user accordingly when building applications.
Without this enhancement, users will encounter issues when trying to create applications on clusters that do not have the default s2i setup.
Console will have to hide any workflows that rely solely on buildconfigs and pipelines is not installed.
If we detect Shipwright, then we can call that API instead of buildconfigs. We need to understand the timelines for the latter part, and create a separate work item for it.
If both buildconfigs and Shipwright are available, then we should default to Shipwright. This will be part of the separate work item needed to support Shipwright.
Rob Gormley to confirm timelines when customers will have to option to remove buildconfigs from their clusters. That will determine whether we take on this work in 4.15 or 4.16.
Description of problem:
Version-Release number of selected component (if applicable):
Tested with https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-08-21-033349
How reproducible:
Always with the latest nightly build when the Build and DeploymentConfig capabilities are disabled
Steps to Reproduce:
Create a 4.14 shared cloud and disable the capabilities for Samples, Builds and DeploymentConfigs
capabilities: baselineCapabilitySet: None additionalEnabledCapabilities: - baremetal - Console - Insights - marketplace - Storage # - openshift-samples - CSISnapshot - NodeTuning - MachineAPI # - Build # - DeploymentConfig
Actual results:
The following main navigation entries are missed:
(Only Helm, ConfigMap and Secret is shown.)
The add page should still show the "Import from Git" which could not be used to import a resource without the BuildConfig.
Expected results:
All navigation items should be displayed.
The add page should not show "Import from Git" if the BuildConfig CRD isn't installed.
Additional info:
More details at ARO managed identity scope and impact.
This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Epic to capture the items not blocking for OCPSTRAT-506 (OCPBU-8)
Evaluate if any of the ARO predefined roles in the credentials request manifests of OpenShift cluster operators give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
Address technical debt around self-managed HCP deployments, including but not limited to
Description of the Problem:
When we deploy a IPv6/Disconnected HostedCluster, we can see that the Ingress Cluster Operator looks as degraded showing this message:
clusteroperator.config.openshift.io/ingress 4.14.0-0.nightly-2023-08-29-102237 True False True 43m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitive Failures: Canary route checks for the default ingress controller are failing)
Also we can see the canary route accesible from the ingressOperator pod using curl command but the golang code doesn't.
2023-08-31T16:23:07.264Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "error sending canary HTTP request to \"canary-openshift-ingress-canary.apps.hosted.hypershiftbm.lab\": Get \"https://canary-openshift-ingress-canary.apps.hosted.hypershiftbm.lab/\": socks connect tcp 127.0.0.1:8090->canary-openshift-ingress-canary.apps.hosted.hypershiftbm.lab:443: unknown error host unreachable"}
After a debugging session, looks like the DNS resolution of the ingress operator through SOCKS proxy which also go through Konnectivity component, does not work properly because delegates the resolution in the Hub cluster which is not the desired behaviour.
Release Shipwright as OpenShift Builds GA
The scope of GA is:
This GA release is intended to make a fully-supported offering of OpenShift Builds driven by the Shipwright framework. This includes both CLI and Operator usage. All Red Hat supported build engines are supported (buildah, s2i), but the priority is to ensure that there are no blocking Buildpacks-related issues and to triage and resolve non-blocking issues.
The softer goal of this GA release is to start to draw users to the Shipwright ecosystem, which should allow them greater flexibility in bringing their CI/CD workloads to the OpenShift platform.
Functionality and roadmap items not specifically related to improving support for Buildpacks.
No known external dependencies.
Background, and strategic fit
The overarching goal for OpenShift Builds is to provide an extensible and modular framework for integrating into development workflows. Interoperability should be considered a priority, and build strategy-specific code should be kept to a minimum or implemented in a manner such that support fo other build strategies is not impacted wherever possible.
Shipwright is an upstream community project with its own goals and direction, and while we are involved heavily in the project, we need to ensure buy-in for our initiatives, and/or determine what functionality and features we are “willing” to accept as downstream-only.
No assumptions are made about hardware, software, or people resources.
Customer Considerations
None.
Documentation will heavily rely on the upstream Shipwright documentation. Documentation plan is here.
N/A
GA involves the status of Shipwright's Build, CLI, and Operator projects for the upstream version v0.12.0. More information can be found at https://shipwright.io
The OpenShift currently has limited support for Shipwright builds. Additionally, this support is marked as Tech Preview and is using the alpha version of the API.
Provide additional support for Shipwright builds, moving to the beta API and removing the Tech Preview labels.
Supporting layered products
Shipwright integration / OpenShift Builds will go GA with 4.14.
Description of problem:
When the user selects All namespaces in the admin perspective and navigates to Builds > BuildConfigs or Builds > Shipwright Builds (if the operator is installed) the last runs was selected based on their name. But the filter doesn't check that the Build / BuildRun are in the same namespace as the BuildConfig / Build.
Version-Release number of selected component (if applicable):
4.14 after we've merged ODC-7306 / https://github.com/openshift/console/pull/12809
How reproducible:
Always
Steps to Reproduce:
For Shipwright Builds and BuildRun you need to install the Pipelines operator, the Shipwright operator, create a Shipwright Build Resource (to enable the SW operator) as well as to Builds and BuildsRuns in two different namespaces.
You can find some Shipwright Build samples here: https://github.com/openshift/console/tree/master/frontend/packages/shipwright-plugin/samples
Actual results:
Both BuildConfigs are shown, but both shows the same Build as last run.
Expected results:
Both BuildConfigs should show and link the Build from their own namespace.
Additional info:
This issue exists also in Pipelines, but we track this in another bug to backport that issue.
As a user, I want to see the latest build status in the Build list similar to Pipelines.
SW samples: https://github.com/openshift/console/tree/master/frontend/packages/shipwright-plugin/samples
The shipwright-plugin contains already code to render a status, age, and duration. Ptal: https://github.com/openshift/console/tree/master/frontend/packages/shipwright-plugin/src/components
For Pipelines we switched later from a "getting the related PipelineRuns for each row" to a more performant solution that "loads all PipelineRuns' and then filter them on the client side. See https://github.com/openshift/console/pull/12071 - Expect that we should do this here similar.
When multiple rows request the same API (get all PipelineRuns) our useK8sResource hook is smart enough to make just one API call.
To find all OpenShift Builds for one OpenShift BuildConfig they need to be filtered by the label openshift.io/build-config.name=build.metadata.name
As a user, I want to see similar information at similar places for the 3 different Build types.
As a user, I want to see the latest build status in the Build list similar to Pipelines.
You might can improve this code review: https://github.com/openshift/console/pull/12809#pullrequestreview-1471632918
As a user, I want to see the Output image of an Shipwright Build on the list page. Before 4.13 the Developer console shows the Build output (full image string) and the Build status.message.
With 4.14 we shows the latest BuildRun name, status, start time, and duration. But the image output is still interesting. See https://redhat-internal.slack.com/archives/C050MAQKD1A/p1688378025053659?thread_ts=1688371150.047769&cid=C050MAQKD1A
For example:
Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.
A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
requirement | Notes | isMvp? |
Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.
Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.
Q: how challenging will it be to support multi-node clusters with this feature?
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>
<Does the Feature introduce data that could be gathered and used for Insights purposes?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< What does success look like?>
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact>
< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>
< Which other products and versions in our portfolio does this feature impact?>
< What interoperability test scenarios should be factored by the layered product(s)?>
Question | Outcome |
Configure the static IP (during the initial "factory" installation) with nmstate.
Set the machine network to point to the network of this IP
Add node_ip hint according to the machine network. (done automatically when using assisted/ABI)
Remove all current hacks (adding the env overrides to crio and kubelet)
Check whether the network manager pre-up script is still required.
Context
https://docs.google.com/document/d/1Ywi-llZbOt-YUmqx7I6jWQP_Rss4eM-uoYJwD7Z0fh0/edit
https://github.com/loganmc10/openshift-edge-installer/blob/main/edge/docs/RELOCATABLE.md
openshift- service-ca service-ca pod takes a few minutes to start when installing SNO
kubectl get events -n openshift-service-ca --sort-by='.metadata.creationTimestamp' -o custom-columns=FirstSeen:.firstTimestamp,LastSeen:.lastTimestamp,Count:.count,From:.source.component,Type:.type,Reason:.reason,Message:.message FirstSeen LastSeen Count From Type Reason Message 2023-01-22T12:25:58Z 2023-01-22T12:25:58Z 1 deployment-controller Normal ScalingReplicaSet Scaled up replica set service-ca-6dc5c758d to 1 2023-01-22T12:26:12Z 2023-01-22T12:27:53Z 9 replicaset-controller Warning FailedCreate Error creating: pods "service-ca-6dc5c758d-" is forbidden: error fetching namespace "openshift-service-ca": unable to find annotation openshift.io/sa.scc.uid-range 2023-01-22T12:27:58Z 2023-01-22T12:27:58Z 1 replicaset-controller Normal SuccessfulCreate Created pod: service-ca-6dc5c758d-k7bsd 2023-01-22T12:27:58Z 2023-01-22T12:27:58Z 1 default-scheduler Normal Scheduled Successfully assigned openshift-service-ca/service-ca-6dc5c758d-k7bsd to master1
Seems that creating the serivce-ca namespace early allows it to get
openshift.io/sa.scc.uid-range annotation and start running earlier, the
service-ca pod is required for other pods (CVO and all the control plane pods) to start since it's creating the serving-cert
Description of problem:
The bootkube scripts spend ~1 minute failing to apply manifests while waiting fot eh openshift-config namespace to get created
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1.Run the POC using the makefile here https://github.com/eranco74/bootstrap-in-place-poc 2. Observe the bootkube logs (pre-reboot)
Actual results:
Jan 12 17:37:09 master1 cluster-bootstrap[5156]: Failed to create "0000_00_cluster-version-operator_01_adminack_configmap.yaml" configmaps.v1./admin-acks -n openshift-config: namespaces "openshift-config" not found .... Jan 12 17:38:27 master1 cluster-bootstrap[5156]: "secret-initial-kube-controller-manager-service-account-private-key.yaml": failed to create secrets.v1./initial-service-account-private-key -n openshift-config: namespaces "openshift-config" not found Here are the logs from another installation showing that it's not 1 or 2 manifests that require this namespace to get created earlier: Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-ca-bundle-configmap.yaml": failed to create configmaps.v1./etcd-ca-bundle -n openshift-config: namespaces "openshift-config" not found Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-client-secret.yaml": failed to create secrets.v1./etcd-client -n openshift-config: namespaces "openshift-config" not found Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-metric-client-secret.yaml": failed to create secrets.v1./etcd-metric-client -n openshift-config: namespaces "openshift-config" not found Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-metric-serving-ca-configmap.yaml": failed to create configmaps.v1./etcd-metric-serving-ca -n openshift-config: namespaces "openshift-config" not found Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-metric-signer-secret.yaml": failed to create secrets.v1./etcd-metric-signer -n openshift-config: namespaces "openshift-config" not found Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-serving-ca-configmap.yaml": failed to create configmaps.v1./etcd-serving-ca -n openshift-config: namespaces "openshift-config" not found Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-signer-secret.yaml": failed to create secrets.v1./etcd-signer -n openshift-config: namespaces "openshift-config" not found Jan 12 17:38:10 master1 bootkube.sh[5121]: "kube-apiserver-serving-ca-configmap.yaml": failed to create configmaps.v1./initial-kube-apiserver-server-ca -n openshift-config: namespaces "openshift-config" not found Jan 12 17:38:10 master1 bootkube.sh[5121]: "openshift-config-secret-pull-secret.yaml": failed to create secrets.v1./pull-secret -n openshift-config: namespaces "openshift-config" not found Jan 12 17:38:10 master1 bootkube.sh[5121]: "openshift-install-manifests.yaml": failed to create configmaps.v1./openshift-install-manifests -n openshift-config: namespaces "openshift-config" not found Jan 12 17:38:10 master1 bootkube.sh[5121]: "secret-initial-kube-controller-manager-service-account-private-key.yaml": failed to create secrets.v1./initial-service-account-private-key -n openshift-config: namespaces "openshift-config" not found
Expected results:
expected resources to get created successfully without having to wait for the namespace to get created.
Additional info:
Description of problem:
When installing SNO with bootstrap in place the cluster-policy-controller hangs for 6 minutes waiting for the lease to be acquired.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1.Run the PoC using the makefile here https://github.com/eranco74/bootstrap-in-place-poc 2.Observe the cluster-policy-controller logs post reboot
Actual results:
I0530 16:01:18.011988 1 leaderelection.go:352] lock is held by leaderelection.k8s.io/unknown and has not yet expired I0530 16:01:18.012002 1 leaderelection.go:253] failed to acquire lease kube-system/cluster-policy-controller-lock I0530 16:07:31.176649 1 leaderelection.go:258] successfully acquired lease kube-system/cluster-policy-controller-lock
Expected results:
Expected the bootstrap cluster-policy-controller to release the lease so that the cluster-policy-controller running post reboot won't have to wait the lease to expire.
Additional info:
Suggested resolution for bootstrap in place: https://github.com/openshift/installer/pull/7219/files#diff-f12fbadd10845e6dab2999e8a3828ba57176db10240695c62d8d177a077c7161R44-R59
Description of problem:
while trying to figure out why it takes so long to install Single node OpenShift I noticed that the kube-controller-manager cluster operator is degraded for ~5 minutes due to: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused I don't understand how the prometheusClient is successfully initialized, but we get a connection refused once we try to query the rules. Note that if the client initialization fails the kube-controller-manger won't set the GarbageCollectorDegraded to true.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. install SNO with bootstrap in place (https://github.com/eranco74/bootstrap-in-place-poc) 2. monitor the cluster operators staus
Actual results:
GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused
Expected results:
Expected the GarbageCollectorDegraded status to be false
Additional info:
It seems that for PrometheusClient to be successfully initialised it needs to successfully create a connection but we get connection refused once we make the query. Note that installing SNO with this patch (https://github.com/eranco74/cluster-kube-controller-manager-operator/commit/26e644503a8f04aa6d116ace6b9eb7b9b9f2f23f) reduces the installation time by 3 minutes
To give Telco Far Edge customers as much of the product support lifespan as possible, we need to ensure that OCP releases are "telco ready" when the OCP release is GA.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
No documentation required
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Implement the hwlatdetect test in the openshift-test binary under the openshift/nodes/realtime test suite
https://docs.google.com/document/d/13Db7uChVx-2JXqAMJMexzHbhG3XLNLRy9nZ_7g9WbFU/edit#
* Enable setting node labels on spoke cluster during installation
Modify the scripts in assisted-service/deploy/operator/ztp.
The following environment variables will be added:
MANIFESTS: JSON containing the manifests to be added for day1 flow. The key is the file name, and the value is the content.
NODE_LABELS: Dictionary of dictionaries. The Outer dictionary key is the node name and the value is the node labels (key, value) to be applied.
MACHINE_CONFIG_POOL: Dictionary of strings. The key is the node name and the value is machine config pool name.
SPOKE_WORKER_AGENTS: Number of worker nodes to be added as part of day1 installation. Default 0
Reduce the OpenShift platform and associated RH provided components to a single physical core on Intel Sapphire Rapids platform for vDU deployments on SingleNode OpenShift.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Provide a mechanism to tune the platform to use only one physical core. |
Users need to be able to tune different platforms. | YES |
Allow for full zero touch provisioning of a node with the minimal core budget configuration. | Node provisioned with SNO Far Edge provisioning method - i.e. ZTP via RHACM, using DU Profile. | YES |
Platform meets all MVP KPIs | YES |
Questions to be addressed:
Latest status as of 4.14 freeze:
The MCD no longer uses MachineConfigs to update certs, but rather reads it off our internal resource "controllerconfig" directly. The MachineConfig path still exists but is a no-op (although the MCO still falsely claims an update is pending as a result). The MachineConfig removal work is ready, but waiting for windows-MCO to change their workflow so as to not break them.
--------------------------------
The logic for handling certificate rotation should live outside of the MachineConfig-files path as it stands today. This will allow certs to rotate live, through paused pools, without generating additional churn in rendered configs, and most, if not all, certificates do not require drains/reboots to the node.
Context
The MCO has, since the beginning of time, managed certificates. The general flow is a cluster configmap -> MCO -> controllerconfig -> MCC -> renderedconfig -> MCD -> laid down to disk as a regular file.
When we talk about certs, the MCD actually manages 4 (originally 5) certs: see https://docs.google.com/document/d/1ehdOYDY-SvUU9ffdIKlt7XaoMaZ0ioMNZMu31-Mo1l4/edit (this document is a bit outdated)
Of these, the only one we care about is "/etc/kubernetes/kubelet-ca.crt", which is a bundle of 5 (now 7) certs. This will be expanded on below.
Unlike regular files though, certificates rotate automatically at some set cadence. Prior to 4.7, this would cause the MCD to seemingly randomly start an update and reboot nodes, much to the annoyance of customers, so we made it disruptionless.
There was still one more problem, a lot of users pauses pools for additional safety (which is their way of saying we don't want you to disrupt our workloads), which still gated the certificate from actually rotating in when it updated. In 4.12 and previous versions, this means that at 80% of the 1 year mark, a new kube-apiserver-to-kubelet-signer cert would be generated. After ~12 hours, this would affect some operation (oc logs, etc.) since the old signer was no longer matching the apiserver's new cert. At the one year mark, this would proceed to break entirely the kubelet. To combat this, we added an alert MachineConfigControllerPausedPoolKubeletCA to warn the users about the effects and expiry, which was ok since this should only be an annual occurrence.
Updates for 4.13
In 4.13, we realized that the kubelet-ca cert was being read from a wrong location which updated the kube-apiserver-to-kubelet-signer I mentioned above, but not some other certs. This was not a problem since nobody was depending on them, but in 4.13, monitoring was updated to use the right certs which then subsequently caused reports of kubeletdown to fire, which then David Eads fixed via https://github.com/openshift/machine-config-operator/pull/3458
So now instead of expired certs we have correct certs, which is great, but now we realized that the cert rotation will happen much more frequently.
Previously on the system, we had:
admin-kubeconfig-signer, kubelet-signer, kube-apiserver-to-kubelet-signer, kube-control-plane-signer, kubelet-bootstrap-kubeconfig-signer
now with the correct certs, right after install we get: admin-kubeconfig-signer, kube-csr-signer_@1675718562, kubelet-signer, kube-apiserver-to-kubelet-signer, kube-control-plane-signer, kubelet-bootstrap-kubeconfig-signer, openshift-kube-apiserver-operator_node-system-admin-signer@1675718563
The most immediate issue was bootstrap drift, which John solved via https://github.com/openshift/machine-config-operator/pull/3513
But the issue here is now we are updating two certs:
Meaning that every month we would be generating at least 2 new machineconfigs (new one rotating in, old one rotating out) to manage this.
During install, due to how the certs are set up (bootstrap ones expire in 24h) this means you get 5 MCs within 24 hours: bootstrap bundle, incluster bundle, incluster bundle with 1 new, incluster bundle with 2 new, incluster bundle with 2 new 2 old removed
On top of this, previously the cluster chugged along with the expiry with only the warning, but now, when the old certs rotate and the pools paused, TargetDown and KubeletDown fires after a few hours, making it very bad from a user perspective.
Solutions
Solution1: don't do anything
Nothing should badly break, but the user will get critical alerts after ~1 month if they pause and upgrade to 4.13. Not a great UX
Solution2: revert the monitoring change or mask the alert
A bit late, but potentially doable? Masking the alert will likely mask real issues, though
Solution3: MVP MCD changes (Estimate: 1week)
The MCD update, MCD verification, MCD config drift monitor all ignore the kubelet-ca cert file. The MCD gets a new routine to update the file, reading from a configmap the MCC manages. The MCC still renders the cert but the cert will be updated even if the pool is paused
Solution4: MVP MCC changes (Estimate: a few days)
Have the controller splice in changes even when the pool is paused. John has a MVP here: https://github.com/openshift/machine-config-operator/compare/master...jkyros:machine-config-operator:mco-77-bypass-pause
This is a cleaner solution compared to 3, but will cause the pool to go into updating briefly. If there are other operations causing nodes to be cordoned, etc., we would have to consider overriding that.
Solution5: MCD cert management path (full, Estimate: 1 sprint)
The cert is removed from the rendered-config. The MCC will read it off the controllerconfig and render it into a custom configmap. The MCS will add this additional file when serving content, but it is not part of the rendered-MC otherwise. The MCD will have a new routine to manage the certs live directly.
The bootstrap MCS will also need to have a way to render it into the initial served configuration without it being part of the MachineConfigs (this is especially important for HyperShift). We will have to make sure the inplace updater doesn't break
We may also have to solve config drift problems from bootstrap to incluster, for self-driving and hypershift inplace
We also have to make sure the file isn't deleted upon an update to the new management, so the certs don't disappear for awhile, since the MCD would have seen the diff and deleted it
DOCS (WIP)
https://docs.google.com/document/d/1qXYV9Hj98QhJSKx_2IWBbU_bxu30YQtbR21mmOtBsIg/edit?usp=sharing
although we are removing the config from the machineconfig, ignition (both in bootstrap and in-cluster) need to generate ignition with the certs still, so nodes can join the cluster
We will need the incluster MCS to read from controllerconfig, and bootstrap MCS (during install time) to be able to remove it from the machineconfigs to ensure no drift when master nodes comes up
Once we finish the new method to manage certs, we should extend it to also manage image registry certs, although that is not required for 4.14
It really hurts to have to ask customers to collect on-disk files for us, and when we do this certificate work there is the possibility we will need to chase more race-condition or rendering mismatch issues, so let's see if we can get collection of mcs-machine-config-content.json (for boostrap mismatch) and maybe currentconfig (for those pesky validation issues) added to the must-gather.
Description of problem:
After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time.
Version-Release number of selected component (if applicable):
4.14.0-ec.3
How reproducible:
So far on 2 different environments
Steps to Reproduce:
1. Deploy SNO with Telco DU profile 2. Run system tests 3. Check CSRs status
Actual results:
oc get csr | grep Pending | wc -l 34
Expected results:
No Pending CSRs
Additional info:
This issue blocks retrieving pod logs. Attaching must-gather and sosreport after manually approving CSRs.
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
There are modules shared between the Console application and its dynamic plugins, as configured in
packages/console-dynamic-plugin-sdk/src/shared-modules.ts
For modules configured as "allowFallback: false" (default setting) we should validate the Console provided version range vs. plugin consumed version at webpack build time.
This allows us to detect potential compatibility problems in shared modules (i.e. plugin is built against a different version than what is provided by Console at runtime) when building dynamic plugins.
AC: Add validation for our shared modules of dynamic plugins
We are missing the DeleteModal component in our console-dynamic-plugin-sdk, due to which we need to copy it when building a dynamic-plugin.
AC:
We are missing the AnnotationsModal component and functions handling the input, e.g. on AnnotationSubmit in our console-dynamic-plugin-sdk, due to which we need to copy it when building a dynamic-plugin.
AC:
Currently there is no good way for plugins to get the active namespace outside of resource pages. We should expose useActiveNamespace to support this. (useActiveNamespace is only exposed in the internal API.)
This seems important to pair with NamespaceBar since it's unclear how to get the initial namespace from NamespaceBar. This is borderline a bug since it's not clear how to use NamespaceBar without it. We should consider for 4.12.
AC:
One of the requirements for adopting OpenShift Dynamic Plugin SDK (which is the new version of HAC Core SDK) is to bump the version of react-router to version 6.
For migration from v5 to v6 there is a `react-router-dom-v5-compat` package which should ease the migration process.
AC: Install the `react-router-dom-v5-compat` package into console repo and test for any regressions.
Remove code that was added thought the ACM integration into all of the console's codebase repositories
Since there was decision made stop with the ACM integration, we as a team decided that it would be better to remove the unused code in order avoid any confusion or regressions.
Scour through the console repo and mark all multicluster-related code for removal by adding a "TODO remove multicluster" comment.
AC:
Revert "copiedCSVsDisabled" and "clusters" server flag changes in front and backend code.
AC:
One of the requirements for adopting OpenShift Dynamic Plugin SDK (which is the new version of HAC Core SDK) is to bump the version of react-router to version 6. With Console PR #12861 merged, both Console web application and its dynamic plugins should now be able to start migrating from React Router v5 to v6.
As a team we decided that we are going to split the work per package, but for the core console we will split the work into standalone stories based on the migration strategy.
Console will keep supporting React Router v5 for two releases (end of 4.15) as per CONSOLE-3662.
How to prepare your dynamic plugin for React Router v5 to v6 migration:
[0] bump @openshift-console/dynamic-plugin-sdk-webpack dependency to 0.0.10 * this release adds react-router-dom-v5-compat to Console provided shared modules
[1] (optional but recommended) bump react-router and react-router-dom dependencies to v5 latest * Console provided shared module version of react-router and react-router-dom is 5.3.4
[2] add react-router-dom-v5-compat dependency * Console provided shared module version of react-router-dom-v5-compat is 6.11.2
[3] start migrating to React Router v6 APIs * v5 code is imported from react-router or react-router-dom
[4] (optional but recommended) use appropriate TypeScript typings for react-router and react-router-dom * Console uses @types/react-router version 5.1.20 and @types/react-router-dom version 5.3.3
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The following files in frontend/public/components/RBAC contain components that need to use the v6 useNavigate hook, requiring them to be converted from a class component to a functional component:
AC: Listed components in frontend/public/components/RBAC rewritten from class component to functional component.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The StorageClassFormWithTranslation component in /frontend/public/components/storage-class-form.tsx needs to use the v6 useNavigate hook, which requires it to be converted from a class component to a functional component.
AC: StorageClassFormWithTranslation component in storage-class-form.tsx is rewritten from class component to functional component.
Splitting off tile-view-page.jsx from CONSOLE-3687 into a separate story.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the migration strategy guide. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
frontend/public/components/utils/tile-view-page.jsx contains a component that needs to use the v6 useNavigate hook, requiring it to be converted from a class component to a functional component.
AC: tile-view-page.jsx rewritten from class component to functional component.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The TemplateForm_ component in /frontend/public/components/instantiate-template.tsx needs to use the v6 useNavigate hook, which requires it to be converted from a class component to a functional component.
AC: TemplateForm_ component in instantiate-template.tsx is rewritten from class component to functional component.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The following files in frontend/public/components/cluster-settings contain components that need to use the v6 useNavigate hook, requiring them to be converted from a class component to a functional component:
AC: Listed components in frontend/public/components/cluster-settings rewritten from class component to functional component.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The EditYAML component in /frontend/public/components/edit-yaml.jsx needs to use the v6 useNavigate hook, which requires it to be converted from a class component to a functional component.
AC: EditYAML component in edit-yaml.jsx is rewritten from class component to functional component.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The App component in /frontend/public/components/app.jsx needs to use the v6 useLocation hook, which requires it to be converted from a class component to a functional component.
AC: App component in app.jsx is rewritten from class component to functional component.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The CheckBoxes_ component in /frontend/public/components/row-filter.jsx needs to use the v6 useNavigate hook, which requires it to be converted from a class component to a functional component.
AC: CheckBoxes_ component in row-filter.jsx is rewritten from class component to functional component.
AC: CheckBoxes_ component is removed from the codebase.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the migration strategy guide. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The following files in frontend/public/components/utils contain components that need to use the v6 useNavigate hook, requiring them to be converted from a class component to a functional component:
AC: Listed components in frontend/public/components/utils rewritten from class component to functional component.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The following files in frontend/public/components/modals contain components that need to use the v6 useNavigate hook, requiring them to be converted from a class component to a functional component:
AC: Listed components in frontend/public/components/modals rewritten from class component to functional component.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The SecretFormWrapper component in /frontend/public/components/secrets/create-secret.tsx needs to use the v6 useNavigate hook, which requires it to be converted from a class component to a functional component.
AC: ScretFormWrapper component in create-secret.tsx is rewritten from class component to functional component.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The EventStream component in /frontend/public/components/events.jsx needs to use the v6 useParams hook, which requires it to be converted from a class component to a functional component.
AC: EventStream component in events.jsx is rewritten from class component to functional component.
One of the steps of the migration strategy is start using the v6 API for components that are passed to the `CompatRoute` component, based on the [migration strategy guide|https://github.com/remix-run/react-router/discussions/8753.]. This route now has both v5 and v6 routing contexts, so we can start migrating component code to v6.
If the component is a class component, you'll need to convert it to a function component first so that you can use hooks.
The FireMan component in /frontend/public/components/factory/list-page.tsx needs to use the v6 useParams and useLocation hooks, which requires it to be converted from a class component to a functional component.
AC: FireMan component in list-page.tsx is rewritten from class component to functional component.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Authors: Igal Tsoiref, Riccardo Piccoli, Liat Gamliel
Analysis document: AI Events: RHOSAK vs RH Pipelines
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of the problem:
Documentation for ignore validation API should be updated with the correct json string arrays:
While it should be :
{ "host-validation-ids": "[\"all\"]", "cluster-validation-ids": "[\"all\"]" }How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Description of the problem:
In staging, BE 2.17.0 - Ignore validation API has no validation for the values sent. For example:
curl -X 'PUT' 'https://api.stage.openshift.com/api/assisted-install/v2/clusters/be4cdbef-7ea6-48f6-a30a-d1169eeb38fb/ignored-validations' --header "Authorization: Bearer $(ocm token)" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "host-validation-ids": "[\"testTest\",\"HasCPUCoresForRole\"]", "cluster-validation-ids": "[]" }'
Stores:
{"cluster-validation-ids":"[]","host-validation-ids":"[\"testTest\",\"HasCPUCoresForRole\"]"}
How reproducible:
100%
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Description of the problem:
In BE 2.16.0 Staging - while cluster is in installed or installing state, ignore validation API changes the validations, but this should be blocked.
How reproducible:
100%
Steps to reproduce:
1. send this call to installed cluster
curl -i -X PUT 'https://api.stage.openshift.com/api/assisted-install/v2/clusters/${cluster_id}/ignored-validations' --header "Authorization: Bearer $(ocm token)" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"host-validation-ids": "[\"all\"]", "cluster-validation-ids": "[\"all\"]"}'
2. Cluster validation is changed
3.
Actual results:
Expected results:
1. Proposed title of this feature request
Delete worker nodes using GitOps / ACM workflow
2. What is the nature and description of the request?
We use siteConfig to deploy a cluster using the GitOPS / ACM workflow. We can also use siteConfig to add worker nodes to an existing cluster. However, today we cannot delete a worker node using the GitOps / ACM work flow. We need to go and manually delete the resources (BMH, nmstateConfig etc.) and the OpenShift node. We would like to have the node deleted as part of the GitOps workflow.
3. Why does the customer need this? (List the business requirements here)
Worker nodes may need to be replaced for any reason (hardware failures) which may require deletion of a node.
If we are colocating OpenShift and OpenStack control planes on the same infrastructure (using OpenStack director operator to create OpenStack control plane in OCP virtualization), then we also have the use case of assigning baremetal nodes as OpenShift worker nodes or OpenStack compute nodes. Over time we may need to change the role of those baremetal nodes (from worker to compute or from compute to worker). Having the ability to delete worker nodes via GitOps will make it easier to automate that use case.
4. List any affected packages or components.
ACM, GitOps
In order to cleanly remove a node without interrupting existing workloads it should be cordoned and drained before it is powered off.
This should be handled by BMAC and should not interrupt processing of other requests. The best implementation I could find so far is in the kubectl code, but using that directly is a bit problematic as the call waits for all the pods to be stopped or evicted before returning. There is a timeout, but then we have to either give up after one call and remove the node anyway, or track multiple calls to drain across multiple reconciles.
We should come up with a way to drain asynchronously (maybe investigate what CAPI does).
We should allow for users to control removing the spoke node using resources on the hub.
For the ZTP-gitops case, this needs to be the BMH as they are not aware of the agent resource.
The user will add an annotation to the BMH to indicate that they want us to manage the lifecycle of the spoke node based on the BMH. Then, when the BMH is deleted we will clean the host and remove it from the spoke cluster.
Description of the problem:
In staging, UI 2.19.6 - In new cluster events - number of events is shown as "1-10 of NaN" instead of the real number
How reproducible:
100%
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
After deprecating the old API and make sure UI is no longer use it - remove the following endpoint and definitions:
/v2/feature-support-levels definitions: feature-support-levels: feature-support-level:
Description of the problem:
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Description of the problem:
Method return empty object when calling GET v2/support-levels/features?openshift_version=X
How reproducible:
Call GET v2/support-levels/features?openshift_version=4.13
Steps to reproduce:
1. Call GET v2/support-levels/features?openshift_version=4.13
2.
3.
Actual results:
{}
Expected results:
{ FEATURE_A: supported, FEATURE_B supported ... }Description of the problem:
Returning bad request on feature-support validation is colliding with multi platform feature.
Whenever the user set the CPU architecture to P or Z the platform changed to multi causing loose of information and not failing the cluster registration/update
How reproducible:
Register a cluster with s390x as CPU architecture on OCP version 4.12
Expected results:
Bad Request
Description of the problem:
Currently installing ppc64le cluster with Cluster Managed Networking enabled and Minimal ISO are not supported.
Steps to reproduce:
1. Create ppc64le cluster with UMN enabled
Actual results:
BadRequest
Expected results:
Created successfully
Add option that will mark that feature is not available at all
Create single place on assisted-service (update/register cluster) where we will return bed request in case that feature combination is not supported
Description of the problem:
BE 2.17.4 - (using API calls) creating new cluster, PATCH it with OLM opertors and then create new infra-env with P/Z should be blocked, but is allowed
How reproducible:
100%
Steps to reproduce:
1. Create new cluster
curl -X 'POST' \ 'https://api.stage.openshift.com/api/assisted-install/v2/clusters/' \ --header "Authorization: Bearer $(ocm token)" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "name": "s390xsno2413", "high_availability_mode": "Full", "openshift_version": "4.13", "pull_secret": "${pull_secret}", "base_dns_domain": "redhat.com", "cpu_architecture": "s390x", "disk_encryption": { "mode": "tpmv2", "enable_on": "none" }, "tags": "", "user_managed_networking": true }'
2. Patch with OLM operators
curl -i -X 'PATCH' 'https://api.stage.openshift.com/api/assisted-install/v2/clusters/c05ba143-cf22-44ec-b1fd-edad5d8ca5a9' --header "Authorization: Bearer $(ocm token)" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "olm_operators":[{"name":"cnv"},{"name":"lso"},{"name":"odf"}] }'
3. Create infra-env
curl -X 'POST' 'https://api.stage.openshift.com/api/assisted-install/v2/infra-envs' --header "Authorization: Bearer $(ocm token)" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "name": "tests390xsno_infra-env2", "pull_secret": "${pull_secret}", "cluster_id": "c05ba143-cf22-44ec-b1fd-edad5d8ca5a9", "openshift_version": "4.13", "cpu_architecture": "s390x" }'
Actual results:
Infra-env created
Expected results:
Should be blocked
Making sure we no longer supporting the OCP 4.8+4.9 releases, when those get EOL (at April 27, 2023).
Installation of OCP 4.8 and 4.9 is no longer possible in any of our envs.
As Assisted Installer documentation embedded into the relevant OpenShift releases, no documentation changes are required. As those version docs are marked deprecated / decommissioned so are the Assisted Installer parts.
Not relevant.
Numbers don't count too much here, as we're following the official policy for OpenShift. If a customer has any real need in OpenShift 4.8 or 4.9, it can get into the process of a Support Exception for OpenShift.
Regardless, as of today (Mar. 16, 2023) there's still some usage of OCP 4.8 & 4.9 but it's not very significant:
AFAIK UI shouldn't have any special code/configuration for OCP versions, so implementing the relevant pieces in the backend should suffice.
vSphere platform configuration is a bit different on OCP 4.13.
Changes needed:
Yes
DOD:
Prevent creating a vSphere dual-stack cluster using feature-support mechanism
Manage the effort for adding jobs for release-ocm-2.8 on assisted installer
https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng
Merge order:
Update the BUNDLE_CHANNELS in the Makefile in assisted-service and run bundle generation.
Description of the problem:
Update the following Day2 procedure - https://github.com/openshift/assisted-service/blob/master/docs/user-guide/day2-master/411-healthy.md -
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
The API for it is https://github.com/openshift/assisted-service/blob/2bbbcb60eea4ea5a782bde995bdec3dd7dfc1f62/swagger.yaml#L5636
Other assets
https://github.com/openshift/installer/blob/master/docs/user/customization.md
Example
Adding day-1 kernel arguments
Marvel
Description of the problem:
It is possible to create such manifest file name:
ee ll ii aa.yml
How reproducible:
Steps to reproduce:
1.create cluster
2. add manifest with spaces within file name
3.
Actual results:
it alows adding the manifest
Expected results:
Even if BE allows it , we should consider disabling the option (see slack thread) in UI and BE
Description of the problem:
In the File name field of the Custom manifest, there should be escription pop up text , which tell type of file which need to be added and max size
How reproducible:
Steps to reproduce:
1.create cluster with manifest
2.Navigate to custom manifest wizard
3. click on add new manifest
Actual results:
File name label has no further description text
Expected results:
I suggest adding , file type , and maz size/length
Description of the problem:
V2CreateClusterManifest should block empty manifests
How reproducible:
100%
Steps to reproduce:
1. POST V2CreateClusterManifest manifest with empty content
Actual results:
Succeeds. Then silently breaks bootkube much later.
Expected results:
API call should fail immediately
Description of the problem:
When installing any cluster , without customer manifest , in the installation summary page we see , name os custom manifest presented
It is not understandable if those manifest where added from the customer himself , or from the AI
How reproducible:
100%
Steps to reproduce:
1.Install cluster without custom manifest checked
2.After installation completed check cluster summary
3.
Actual results:
in summary several files are mentioned in the Custom manifest div part
presented as custom manifest
Expected results:
User should be informed if these custom manifest are added from UI , and which are not
We are looking into allowing users to rename the manifest file name. Currently this is only available by issuing a DELETE and POST reqs which result in a very bad UX.
We need a API to allow users to change the folder, file name or yaml content of existing custom manifest.
Discussion about that: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1675776466170339
When the control plane nodes are under pressure or the apiserver is just not available, no telemetry data is emitted by the monitoring stack although monitoring isn't on master node and shouldn't have to interact with the control plane in order to push metrics.
This is caused by the fact that today telemeter-client is evaluating promQL expressions on Prometheus via an oauth-proxy endpoint that requires talking to the apiserver to be authenticated.
After discussing with Simon Pasquier, a potential solution to remove the dependency on the apiserver would be to use mTLS communication between telemeter-client and the Prometheus pods.
Today, there are 3 proxies in the Prometheus pods:
The kube-rbac-proxy exposing the /metrics endpoint could be used by telemeter-client since it is already doing so via mTLS.
Note that this approach would require improving telemeter-client since it doesn't support configure TLS certs/keys.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is the second part of Customizations for Node Exporter, following https://issues.redhat.com/browse/MON-2848
There are the following tasks remaining:
The "mountstats" collector generates 53 high-cardinality metrics by default, we have to refine the story to choose only the necessary metrics to collect.
Cluster Monitoring Operator uses the configmap "cluster-monitoring-config" in the namespace "openshift-monitoring" as its configuration. These new configurations will be added into the section "nodeExporter".
Node Exporter comes with a set of default activated collectors and optional collectors.
To simplify the configuration, we put a config object for each collector that we allow users to activate or deactivate.
If a collector is not present, no change is made to its default on/off status.
Each collector has a field "enabled" as a on/off switch. If "enabled" is set to "false", other fields can be omitted.
The default value for the new options are:
Here is an example of what these options look like in CMO configmap:
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | maxProcs: 4 nodeExporter: collectors: hwmon: enabled: true mountstats: enabled: true systemd: enabled: true ksmd: enabled: true
If the config for nodeExporter is omitted, Node Exporter should run with the same arguments concerning collectors as those in CMO v4.12:
--no-collector.wifi --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run/k3s/containerd/.+|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/) --collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*)$ --collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*)$ --collector.cpu.info --collector.textfile.directory=/var/node_exporter/textfile --no-collector.cpufreq --no-collector.tcpstat --no-collector.cpufreq --no-collector.tcpstat --collector.netdev --collector.netclass --no-collector.buddyinfo --collector.netdev --collector.netclass --no-collector.buddyinfo
This is a tracker for another feature implemented by the netobserv team, also targeting 4.14: https://issues.redhat.com/browse/OCPBU-478
Tracked in netobserv board with this story: https://issues.redhat.com/browse/NETOBSERV-1021
Pull request: https://github.com/openshift/cluster-monitoring-operator/pull/1963
We will add a section for "ksmd" Collector in "nodeExporter.collectors" section in CMO configmap.
It has a boolean field "enabled", the default value is false.
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | nodeExporter: collectors: # enable a collector which is disabled by default ksmd: enabled: true
refer to: https://issues.redhat.com/browse/OBSDA-308
We will add a section for "systemd" Collector in "nodeExporter.collectors" section in CMO configmap.
It has a boolean field "enabled", the default value is false.
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | nodeExporter: collectors: # enablea collector which is disabled by default systemd: enabled: true
To avoid too many metrics are scraped from systemd units, the collector should collect metrics on selected units only. We put regex patterns of the units to collect in the list `collectors.systemd.units`.
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | nodeExporter: collectors: # enable a collector which is disabled by default systemd: enabled: true units: - iscsi-init.* - sshd.service
We will add a section for "mountstats" Collector in "nodeExporter.collectors" section in CMO configmap.
It has a boolean field "enabled", the default value is false.
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | nodeExporter: collectors: # enable a collector which is disabled by default mountstats: enabled: true
The "mountstats" collector generates many high cardinality metrics, so we will collector only these metrics to avoid data congestion:
1. node_mountstats_nfs_read_bytes_total
2. node_mountstats_nfs_write_bytes_total
3. node_mountstats_nfs_operations_requests_total
Node Exporter has been upgraded to 1.5.0.
The default value of argument `--runtime.gomaxprocs` is set to 1, different from the old behavior. Node Exporter used to take advantage of multiple processes to accelerate metrics collection.
We are going to add a parameter to set the argument `--runtime.gomaxprocs` and make its default value 0. So that CMO retains the old behavior while allowing users to tune the multiprocess settings of Node Exporter.
The CMO config will have a new section `nodeExporter`, under which there is the parameter `maxProcs`, accepting an integer number as the maximum number of process Node Exporter runs concurrently. Its default value is 0 if omitted.
config.yaml: | nodeExporter: maxProcs: 1
Proposed title of this feature request
In 4.11 we introduced alert overrides and alert relabeling feature as tech preview. We should graduate this feature to GA.
What is the nature and description of the request?
This feature can address requests and issues we have seen from existing and potential customers. Moving this feature to GA would greatly enable adoption.
Why does the customer need this? (List the business requirements)
See linked issues.
List any affected packages or components.
CMO
https://github.com/openshift/monitoring-plugin
CMO should deploy and enable monitoring Plugin.
We should run at least https://github.com/golangci/golangci-lint.
https://github.com/securego/gosec could be interesting.
We also have an internal team: https://gitlab.cee.redhat.com/covscan/covscan/-/wikis/home. Maybe there are additional scanners we can possibly run.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
https://github.com/openshift/cluster-monitoring-operator/pull/1989
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
to help customers debugging we need to able to collect include noo pods and resource in the collected must-gather script
to collect it use
oc adm must-gather
New script to collect netobservability logs and enable it by default
As a developer i want to have my testing and build tooling managed in a consistent way for reduce amount of context switches during doing a maintenance work.
Currently our approach to manage and update auxiliary tooling (such as envtest, controller-gen, etc) is inconsistent. Fine pattern was introduced in CPMS repo, which relies on golang toolchain for update, vendor and run this auxiliary tooling.
For CPMS context see:
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/tools/tools.go
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/go.mod#L24
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/Makefile#L19
Currently our approach to manage and update auxiliary tooling (such as envtest, controller-gen, etc) is inconsistent. Fine pattern was introduced in CPMS repo, which relies on golang toolchain for update, vendor and run this auxiliary tooling.
For CPMS context see:
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/tools/tools.go
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/go.mod#L24
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/Makefile#L19
This epic has 3 main goals
Currently we have no accurate telemetry of usage of the OpenShift Console across all clusters in the fleet. We should be able to utilize the auth and console telemetry to glean details which will allow us to get a picture of console usage by our customers.
Let's do a spike to validate, and possibly have to update this list after the spike:
Need to verify HOW do we define a cluster Admin -> Listing all namespaces in a cluster? Install operators? Make sure that we consider OSD cluster admins as well (this should be aligned with how we send people to dev perspective in my mind)
Capture additional information via console plugin ( and possibly the auth operator )
Understanding how to capture telemetry via the console operator
We have removed the following ACs for this release:
As RH PM/engineer, we want to understand the usage of the (dev) console, for that, we want to add new Prometheus metrics (how many users have a cluster, etc.) and collect them later (as telemetry data) via cluster-monitoring-operator.
As Red Hat, we want to understand the usage of the (dev) console, for that, we want to add new Prometheus metrics (how many users have a cluster, etc.) and collect them later (as telemetry data) via cluster-monitoring-operator.
Eigther the console-operator or cluster-monitoring-operator needs to apply a PrometheusRule to collect the right data and make these later available in Superset DataHat or Tableau.
Description of problem:
With 4.13 we added new metrics to the console (Epic ODC-7171 - Improved telemetry (provide new metrics), that collect different user and cluster metrics.
The cluster metrics include:
These metrics contain the perspective name or plugin name which was unbounded. Admins could configure any perspective and plugin name, also if the perspective or plugin with that name is not available.
Based on the feedback in https://github.com/openshift/cluster-monitoring-operator/pull/1910 we need to reduce the cardinality and limit the metrics to, for example:
Version-Release number of selected component (if applicable):
4.13.0
How reproducible:
Always
Steps to Reproduce:
On a cluster, you must update the console configuration, configure some perspectives or plugins and check the metrics in Admin > Observe > Metrics:
avg by (name, state) (console_plugins_info) avg by (name, state) (console_customization_perspectives_info)
On a local machine, you can use this console yaml:
apiVersion: console.openshift.io/v1 kind: ConsoleConfig plugins: logging-view-plugin: https://logging-view-plugin.logging-view-plugin-namespace.svc.cluster.local:9443/ crane-ui-plugin: https://crane-ui-plugin.crane-ui-plugin-namespace.svc.cluster.local:9443/ acm: https://acm.acm-namespace.svc.cluster.local:9443/ mce: https://mce.mce-namespace.svc.cluster.local:9443/ my-plugin: https://my-plugin.my-plugin-namespace.svc.cluster.local:9443/ customization: perspectives: - id: admin visibility: state: Enabled - id: dev visibility: state: AccessReview accessReview: missing: - resource: namespaces verb: get - id: dev1 visibility: state: AccessReview accessReview: missing: - resource: namespaces verb: get - id: dev2 visibility: state: AccessReview accessReview: missing: - resource: namespaces verb: get - id: dev3 visibility: state: AccessReview accessReview: missing: - resource: namespaces verb: get
And start the bridge with:
./build-backend.sh ./bin/bridge -config ../config.yaml
After that you can fetch the metrics in a second terminal:
Actual results:
curl -s localhost:9000/metrics | grep ^console_plugins console_plugins_info{name="acm",state="enabled"} 1 console_plugins_info{name="crane-ui-plugin",state="enabled"} 1 console_plugins_info{name="logging-view-plugin",state="enabled"} 1 console_plugins_info{name="mce",state="enabled"} 1 console_plugins_info{name="my-plugin",state="enabled"} 1
curl -s localhost:9000/metrics | grep ^console_customization console_customization_perspectives_info{name="dev",state="only-for-developers"} 1 console_customization_perspectives_info{name="dev1",state="only-for-developers"} 1 console_customization_perspectives_info{name="dev2",state="only-for-developers"} 1 console_customization_perspectives_info{name="dev3",state="only-for-developers"} 1
Expected results:
Less cardinality, that means, results should be grouped somehow.
Additional info:
This epic aims to address some of the RFEs associated with the Pipeline user experience.
Improve the overall user experience when working with OpenShift Pipelines
None
Exploration is available in this Miro board
As a user, I want to manage the column available for the TaskRuns list page
With many PipelineRuns based on the same pipeline, it will get confusing if re-runs are named by the pipeline as they will all be named similarly. Losing the distinction between PipelineRuns will cause lots of additional hassles.
As a user, I want to see the webhook link and webhook secret on the Repository details page and the webhook link on the Repository summary page
As a user, I want to see the PipelineRuns present in the current namespace from the Dev perspective
ODC tests are mainly focused on running tests with kube:admin(cluster-admin privileges) which creates an issue when something gets broken due to rbac issue
To define some basic tests focused on the self-provisioner users which can also be run on CI
Testing with users as pr changes should not break UI
ODC E2E test have flakes which creates failures on CI.
Improving the ODC E2E test flakes, by stabilising the test and improving the speed of test execution.
To improve health of CI which will impact PR review effectiveness.
Skip waiting for the authentication operator to start progressing when the secret already exists
For periodic jobs, our tests will append to existing console tests but because
the value of `waitForAuthOperatorProgressing` changes from true to false at the start of console tests and with the same procedure our tests keep on waiting to fetch its value as true which never happens and tests do not start
upstream repos which contribute to the OLM v0 downstream repo have a 90+ commit delta, with several substantial dependency version bumps.
The interaction between these repos necessitates a coordinated solution, and potentially new upstream contributions to reach dependency equilibrium before bringing downstream.
The goals of this epic are:
We have some existing work in this direction, and this epic is mostly to coordinate across teams. As a result, some existing stories will need some remodeling as we go, and teams should feel free to keep them up to date to reflect the identified work.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The openshift/operator-framework-olm repository is very out-of-date, and needs to be sync'd from upstream.
Acceptance Criteria:
All upstream necessary commits from:
are merged into into openshift/operator-framework-olm repository.
The Kube APIServer has a sidecar to output audit logs. We need similar sidecars for other APIServers that run on the control plane side. We also need to pass the same audit log policy that we pass to the KAS to these other API servers.
During a PerfScale 80 HC test in stage we found that the OBO prometheus monitoring stack was consuming 50G of memory (enough to cause OOMing on the m5.4xlarge instance it was residing on). Additionally, during this time it would also consume over 10 CPU cores.
Snapshot of the time leading up to (effectively idle) and during the test: https://snapshots.raintank.io/dashboard/snapshot/2K5s0PzaN1U2JE1jrxTPZ5jX0fifBuRC
As a SRE, I want to have the ability to filter metrics exposed from the Management Clusters.
Context:
RHOBS resources allocated to HCP are scarce. Currently, we push every single metric to the RHOBS instance.
However, in https://issues.redhat.com/browse/OSD-13741, we've identified a subset of metrics that are important to SRE.
The ability to only export those metrics to RHOBS will reduce significantly the cost of monitoring as well as increase our ability to scale RHOBS.
As discussed in this Slack thread, most of the CPU and memory consumption of the OBO operator is caused at scraping time.
The idea here is to make sure hypershift & control-plane-operator operators are no more specify the scrape interval in servicemonitor & podmonitor scrape configs (unless there is a very good reason to do so).
Indeed, when the scrape interval is not specified at scrape config level, the global scrape interval specified at the root of the config is used. This offers the following benefits:
This is part of solution #1 described here.
When quorum breaks and we are able to get a snapshot of one of the etcd members, we need a procedure to restore the etcd cluster for a given HostedCluster.
Documented here: https://docs.google.com/document/d/1sDngZF-DftU8_oHKR70E7EhU_BfyoBBs2vA5WpLV-Cs/edit?usp=sharing
Add the above documentation to the HyperShift repo documentation.
Many internal projects rely on Red Hat's fork of the OAuth2 Proxy project. The fork differs from the main upstream project in that it added an OpenShift authentication backend provider, allowing the OAuth2 Proxy service to use the OpenShift platform as an authentication broker.
Still, unfortunately, it had never been contributed back to the upstream project - this caused both of the projects, the fork and the upstream, to severely diverge. The fork is also extremely outdated and lacks features.
Among such features not present in the forked version is the support for setting up a timeout for requests from the proxy to the upstream service, otherwise controlled using the --upstream-timeout command-line switch in the official OAuth2 Proxy project.
Without the ability to specifically request timeout, the default value of 30 seconds is assumed (coming from Go's libraries), and this is often not enough to serve a response from a busy backend.
Thus, we need to backport this feature from the upstream project.
Backport the Pull Request from the upstream project into the Red Hat's fork.
Goal: Support OVN-IPsec on IBM Cloud platform.
Why is this important: IBM Cloud is being added as a new OpenShift supported platform, targeting 4.9/4.10 GA.
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Not in scope:
Estimate (XS, S, M, L, XL, XXL):
Previous Work:
Open questions:
Acceptance criteria:
Epic Done Checklist:
This Epic is here to track the rebase we need to do for kube 1.27, which is already out.
https://docs.google.com/document/d/1h1XsEt1Iug-W9JRheQas7YRsUJ_NQ8ghEMVmOZ4X-0s/edit --> this is the link for rebase help
sig-cli is failing in two different ways:
Failing tests
5 tests fail because of system:authenticated group not having enough permissions on some resources (routes and configmaps).
"[sig-cli] oc basics can create and interact with a list of resources [Suite:openshift/conformance/parallel]"
"[sig-cli] oc basics can show correct whoami result [Suite:openshift/conformance/parallel]"
"[sig-cli] oc can route traffic to services [apigroup:route.openshift.io] [Suite:openshift/conformance/parallel]"
"[sig-cli] oc expose can ensure the expose command is functioning as expected [apigroup:route.openshift.io] [Suite:openshift/conformance/parallel]"
"[sig-network-edge][Feature:Idling] Idling with a single service and ReplicationController should idle the service and ReplicationController properly [Suite:openshift/conformance/parallel]"
There are quite a few tests which are dependent on API groups which do not exist in MicroShift. We can add [apigroup] annotation to skip these tests
[apigroup:oauth.openshift.io]
"[sig-auth][Feature:OAuthServer] OAuthClientWithRedirectURIs must validate request URIs according to oauth-client definition": " [Suite:openshift/conformance/parallel]" "[sig-auth][Feature:OAuthServer] well-known endpoint should be reachable [apigroup:route.openshift.io]": " [Suite:openshift/conformance/parallel]"
[apigroup:operator.openshift.io]
"[sig-network][Feature:MultiNetworkPolicy][Serial] should enforce a network policies on secondary network IPv4": " [Suite:openshift/conformance/serial]" "[sig-network][Feature:MultiNetworkPolicy][Serial] should enforce a network policies on secondary network IPv6": " [Suite:openshift/conformance/serial]" "[sig-storage][Feature:DisableStorageClass][Serial] should not reconcile the StorageClass when StorageClassState is Unmanaged": " [Suite:openshift/conformance/serial]" "[sig-storage][Feature:DisableStorageClass][Serial] should reconcile the StorageClass when StorageClassState is Managed": " [Suite:openshift/conformance/serial]", "[sig-storage][Feature:DisableStorageClass][Serial] should remove the StorageClass when StorageClassState is Removed": " [Suite:openshift/conformance/serial]",
"[sig-auth][Feature:Authentication] TestFrontProxy should succeed [Suite:openshift/conformance/parallel]"
This test is failing because it depends on "aggregator-client" secret, which is not present in MicroShift. We can skip this test.
The goal of this EPIC is to solve several issues related to PDBs
causing issues during OCP upgrades, especially when new apiservers (which is rolling one by one) were wedged (there was some issue with networking on new pods due to rhel upgrades)
slack thread: https://redhat-internal.slack.com/archives/CC3CZCQHM/p1673886138422059
Epic Goal*
This is a tracking issue for the Workloads related work for Microshift 4.13 Improvements. See API-1506 for the whole feature.
followup to https://issues.redhat.com/browse/WRKLDS-487
refactor route-controller-manager to use NewControllerCommandConfig and ControllerBuilder from library-go. Then update dep in microshift and we can pass LeaderElection.Disable in the config to disable leader election as it is not needed in microshift.
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
This epic tracks any part of our codebase / solutions we implemented taking shortcuts.
Whenever a shortcut is taken, we should add a story here not to forget to improve it in a safer and more maintainabile way.
maintanability and debuggability, and in general fighting the technical debt, is critical to keep velocity and ensure overall high quality
https://issues.redhat.com/browse/CNF-796
https://issues.redhat.com/browse/CNF-1479
https://issues.redhat.com/browse/CNF-2134
https://issues.redhat.com/browse/CNF-6745
https://issues.redhat.com/browse/CNF-8036
https://issues.redhat.com/browse/CNF-9566
Description of problem:
According API documentation policyTypes field is optional: https://docs.openshift.com/container-platform/4.11/rest_api/network_apis/networkpolicy-networking-k8s-io-v1.html#specification If this field is not specified, it will default based on the existence of Ingress or Egress rules;
But if policyTypes is not specified all traffic dropped despite what is stated in the rule
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. Configure sriov (nodepolicy + sriovnetwork) 2. Configure 2 pods 3. enable MutiNetworkPolicy 4. apply MutiNetworkPolicy: spec: podSelector: matchLabels: pod: pod1 ingress: - from: - ipBlock: cidr: 192.168.0.2/32 5. send traffic between pods (192.168.0.2 => pod=pod1)
Actual results:
traffic dropped
Expected results:
traffic passed
Additional info:
Address miscellaneous technical debt items in order to maintain code quality, maintainability, and improved user experience.
Role | Contact |
---|---|
PM | Peter Lauterbach |
Documentation Owner | TBD |
Delivery Owner | (See assignee) |
Quality Engineer | (See QA contact) |
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue | <link to GitHub Issue> |
DEV | Upstream code and tests merged | <link to meaningful PR or GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR or GitHub Issue> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | N/A details in user stories. |
QE | Automated tests merged | N/A details in user stories. |
DOC | Downstream documentation merged | <link to meaningful PR> |
kubevirt csi is unable to unpublish a volume in the event that the VM/VMI that the volume was published on unexpectedly disappears. This situation can occur for many reasons. Someone could forcibly delete the VM, an replace update could occur that destroys a VM before it can unpublish a volume, a VM node can become unresponsible and the capi machine controller will delete it, and other scenarios like this.
When this situation occurs, the PVC within the guest will never get deleted properly. Kubevirt csi will report the following error.
I0531 13:07:51.338413 1 controller.go:264] Detaching DataVolume pvc-4c4d4744-8a04-4df1-964b-d4eac90a93a2 from Node ID fc3ad096-53f0-535d-bbd8-45a3ab3803d1
E0531 13:07:51.349493 1 server.go:124] /csi.v1.Controller/ControllerUnpublishVolume returned with error: rpc error: code = NotFound desc = failed to find VM with domain.firmware.uuid 5cb46a00-2b8b-509b-b32b-39d1bab4e8b5
To resolve this, the kubevirt-csi controller needs to gracefully handle unpublishing a volume when the VM and VMI associated with the volume no longer exists.
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
Console-operator should switch from using bindata to using assets, similar to what cluster-kube-apiserver-operator and other operators are doing so we dont need to regenerate the bindata when yaml files are changes.
There is also an issue with generating bindata on ARM and other arch., where switching to assets, will make it obsolete.
Epic Goal*
Provide a way to tune the etcd latency parameters ETCD_HEARTBEAT_INTERVAL and ETCD_ELECTION_TIMEOUT.
Why is this important? (mandatory)
OCP4 does not have a way to tune the etc parameters like timeout, heartbeat intervals, etc. Adjusting these parameters indiscriminately may compromise the stability of the control plane. In scenarios where disk IOPS are not ideal (e.g. disk degradation, storage providers in Cloud environments) this parameters could be adjusted to improve stability of the control plane while raising the corresponding warning notifications.
In the past:
The current default values on a 4.10 deployment
```
name: ETCD_ELECTION_TIMEOUT
value: "1000"
name: ETCD_ENABLE_PPROF
value: "true"
name: ETCD_EXPERIMENTAL_MAX_LEARNERS
value: "3"
name: ETCD_EXPERIMENTAL_WARNING_APPLY_DURATION
value: 200ms
name: ETCD_EXPERIMENTAL_WATCH_PROGRESS_NOTIFY_INTERVAL
value: 5s
name: ETCD_HEARTBEAT_INTERVAL
value: "100"
```
and these are modified for exceptions of specific cloud providers (https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/etcdenvvar/etcd_env.go#L232-L254).
The guidance for latency among control plane nodes do not translate well to on-premise live scenarios https://access.redhat.com/articles/3220991
Scenarios (mandatory)
Defining etcd-operator API to provide the cluster-admin the ability to set `ETCD_ELECTION_TIMEOUT` and `ETCD_HEARTBEAT_INTERVAL` within certain range.
Dependencies (internal and external) (mandatory)
No external teams
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
For https://issues.redhat.com/browse/OCPBU-333 we need an enhancement proposal so we can go over the different options in how we want to allow configuration of the etcd heartbeat, leader election and any other latency parameters that might be required for OCPBU-333.
Once we have the API for configuring the heartbeat interval and leader election timeouts from https://github.com/openshift/api/pull/1538 we will need to reconcile the tuning profile set on the API onto the actual etcd deployment.
This would require updating how we set the env vars for both parameters by first reading the operator.openshift.io/v1alpha1 Etcd "cluster" object and mapping the profile value to the required heartbeat and leader election timeout values in:
https://github.com/openshift/cluster-etcd-operator/blob/381ffb81706699cdadd0735a52f9d20379505ef7/pkg/etcdenvvar/etcd_env.go#L208-L254
Place holder epic to track spontaneous task which does not deserve its own epic.
ServicePublishingStrategy of type LoadBalancer or Route could specify the same hostname, which will result on one of the services not being published. i.e. no DNS records created.
context: https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1678287502260289
DOD:
Validate ServicePublishingStrategy and report conflicting services hostnames
DoD:
This feature is supported by ROSA.
To have an e2e to validate publicAndPrivate <-> Private in the presubmits.
Once the HostedCluster and NodePool gets stopped using PausedUntil statement, the awsprivatelink controller will continue reconciling.
How to test this:
DoD:
If change a NodePool from having .replicas to autoscaler min/Max and set a min beyond the current replicas, that might leave the machineDeployment in a state not suitable to be autoscalable. This require the consumer to ensure the min is <= current replicas which is poor UX. We should be able to automate this ideally
The Hypershift operator deployment fails when we try to deploy it in the RootCI server which has the PSA enabled. So we need to make the hypershift operator deployment restricted PSA compliant
Event:
0s Warning FailedCreate replicaset/operator-66cc5794c9 (combined from similar events): Error creating: pods "operator-66cc5794c9-k2sq7" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "operator" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "operator" must set securityContext.capabilities.drop=["ALL"]), seccompProfile (pod or container "operator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
DoD:
A recurrent question is how iam works in hypershift.
We should document in https://hypershift-docs.netlify.app/ or https://github.com/openshift/enhancements/tree/master/enhancements/hypershift how we handle permissions in AWS
https://redhat-internal.slack.com/archives/C02LM9FABFW/p1674631577577369
OCP components could change their image key in the release payload, which might not be immediately visible to us and would break Hypershift.
DOD:
Validate release contains all the images required by Hypershift and report missing images in a condition
AC:
We have connectDirectlyToCloudAPIs flag in konnectiviy socks5 proxy to dial directly to cloud providers without going through konnectivity.
This introduce another path for exception https://github.com/openshift/hypershift/pull/1722
We should consolidate both by keep using connectDirectlyToCloudAPIs until there's a reason to not.
AWS has a hard limit of 100 OIDC providers globally.
Currently each HostedCluster created by e2e creates its own OIDC provider, which results in hitting the quota limit frequently and causing the tests to fail as a result.
DOD:
Only a single OIDC provider should be created and shared between all e2e HostedClusters.
Most of our conditions status is driven by programatic output of reconciliation loops.
E.g: the HostedCluster available
This is a good signal for day 1, but we might be missing relevant real state of the world for day 2. E.g:
DoD:
Reproduce and review behaviour the examples above.
Consider adding additional knowledge for computing the HCAvailable condition. Health check on expected day 2 holistic e2e behaviour rather than in particular status of subcomponents.
E.g. actually query the kas through the url we expose
This is a placeholder to capture the necessary CI changes to do every release cut.
There are a few places in our CI config which requires pinning to the new release every release cut:
DOD:
Make sure we have this documented in hypershift repo and that all needed is done for current release branch.
DoD:
At the moment if the input etcd kms encryption (key and role) is invalid we fail transparently.
We should check that both key and role are compatible/operational for a given cluster and fail in a condition otherwise
This is intended to be a place to capture general "tech debt" items so they don't get lost. I very much doubt that this will ever get completed as a feature, but that's okay, the desire is more that stores get pulled out of here and put with feature work "opportunistically" when it makes sense.
If you find a "tech debt" item, and it doesn't have an obvious home with something else (e.g. with MCO-1 if it's metrics and alerting) then put it here, and we can start splitting these out/marrying them up with other epics when it makes sense.
As part of https://github.com/openshift/machine-config-operator/pull/3270, Joel moved us to ConfigMapsLeases for our lease because the old way of using ConfigMaps was being deprecated in favor of the "Leases" resource.
ConfigMapsLeases were meant to be the first phase of the migration, eventually ending up on LeasesResourceLock,so at some point we need to finish.
Since we've already had ConfigMapsLeases for at least a release, we should now be able to complete the migration by changing the type of resource lock here https://github.com/openshift/machine-config-operator/blob/4f48e1737ffc01b3eb991f22154fc3696da53737/cmd/common/helpers.go#L43 to LeasesResourceLock
We should probably also clean up after ourselves so nobody has to open something like https://bugzilla.redhat.com/show_bug.cgi?id=1975545 again
(Yes this really should be as easy as it looks, but someone needs to pay attention to make sure something weird doesn't happen when we do it)
Some supporting information is here, if curious:
https://github.com/kubernetes/kubernetes/pull/106852
https://github.com/kubernetes/kubernetes/issues/80289
Finish lease lock type migration by changing lease lock type to LeaseResourceLock
Currently, adding a forcefile(/run/machine-config-daemon-force) will start an update, but it doesn't necessarily do a complete upgrade; if it fits into one of the carve-outs we have for a rebootless update/OSImageURL is the same...it won't do an OS update. We have had a few customers whose clusters are stuck in a quasi state and need to do a complete OS upgrade; even if the "conditions" on cluster indicate that this isn't necessary.
The goal of the story is to update this behavior so that it will also do an OS upgrade(execute applyOSChanges() in its entirety).
This has been broken for a long time, and the actual functionality is quite useless. We have put out a deprecation notice in 4.12, and now we should look to remove it.
The MCD read/writes items to the journal. We should look to remove unnecessary reads from the journal and just log important info, so a broken journal doesn't break the MCD.
Spun off of https://issues.redhat.com/browse/OCPBUGS-8716
The MCD today writes pending configs to journal, which the next boot then uses to read the state.
This is mostly redundant since we also read/write the updated config to disk. The pending config was originally implemented very early on, and today causes more trouble than it helps, since the journal could be broken, or the config could not be found, which is very troublesome to debug and recover.
We should remove the workflow entirely
As an OpenShift infrastructure owner, I want to use the Zero Touch Provisioning flow with RHACM, where RHACM is in a dual-stack hub cluster and the deployed cluster is an IPv6-only cluster.
Currently ZTP doesn't work when provisioning IPv6 clusters from a dual-stack hub cluster. We have customers who aim to deploy new clusters via ZTP that don't have IPv4 and work exclusively over IPv6. To enable this use case work on the metal platform has been identified as a requirement.
Converge IPI and ZTP Boot Flows: METAL-10
We are missing event notifications on creation of some resources. We need to make sure they are notified
Due to change of kafka provider, SASL/PLAIN is not supported any longer
We now need SASL/SCRAML for app-interface integrated MSK
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Do not change the cluster platform in the background due to networking configuration.
Remove user_managed_networking from assisted service
Yes
Allow the user to decide which platform is compatible with each feature, especially UMN and CMN.
e.g. on the networking step, when a platform is being selected, the UI need to know if to show to the user the UMN or CMN networking configuration without taking into consideration cluster.user_managed_networking.
This task goal is give the UI option to not use the current UMN implementation, and give the BE the flexibility to "break" the API.
When creating a cluster in the UI, there is a checkbox that the user can set to indicate that they wan t to use custom manifests.
Presently this will cause the upload of an empty manifest, the presence of which is later used to determine whether the checkbox is checked or not (and whether the custom manifest tab should be shown in the UI).
This is a clunky approach that confuses the user and leads to validation issues.
This functionality needs to be changed to use a cluster tag for this purpose instead.
Presently, when creating a cluster in the UI, there is a checkbox that the user can set to indicate that they wan t to use custom manifests.
Presently this will cause the upload of an empty manifest, the presence of which is later used to determine whether the checkbox is checked or not (and whether the custom manifest tab should be shown in the UI).
This is a clunky approach that confuses the user and leads to validation issues.
To rememdy this, we would like to give the UI team a facility to store
raw JSON data containing freeform UI specific settings for a cluster.
This PR enables that.
When enabling infrastructure operator automatically import the cluster and enable users to add nodes to self cluster via Infrastructure operator
Yes, it's a new functionality that will need to be documented
When assisted service is started in KubeAPI mode, we want to ensure that the local cluster is registered with ACM so that it may be managed in a similar fashion to a spoke, or to put it another way, register the Hub cluster as a Day 2 spoke cluster in ACM running on itself.
The purpose of this task is to create the required secrets, agentclusterinstall and clusterdeployment CR's required to register the hub.
As referenced in the parent Epic, the following guide details the CR's that need to be created to import a "Day 2" spoke cluster https://github.com/openshift/assisted-service/blob/master/docs/hive-integration/import-installed-cluster.md
During this change, it should be ensured that this functionality is added to the reconcile loop of the service.
note: just a placeholder for now
It already happened that operators had configured Prometheus rules which aren't valid:
While we can't catch everything, it should be feasible to check for most common mistakes with the CI.
Update the severity of this origin test from flaky to failure
Exceptions for following Alerts can be cleared, as the Bugzilla is already fixed and released.
For the BZs not fixed, create new Jira OCPBUGS
We added E2E tests for alerting style-guide issues in MON-1643, but a lot of components needed exceptions. We filed bugzillas for these, but we need to check on them and remove the exceptions for any that are fixed.
This has no link to a planing session, as this predates our Epic workflow definition.
CMO should expose a metric, that gives insight into collection profile usage. We add this signal to our telemetry payload.
The minimum solution here is to expose a metric about the collection profile configured.
Other optional metrics could include:
Give users a TopologySpreadConstraints field in the PrometheusRestrictedConfig field and propagate this to the pod that is created.
Give users a TopologySpreadConstraints field in the K8sPrometheusAdapter field and propagate this to the pod that is created.
Give users a TopologySpreadConstraints field in the KubeStateMetricsConfig field and propagate this to the pod that is created.
Give users a TopologySpreadConstraints field in the AlertmanagerUserWorkloadConfig field and propagate this to the pod that is created.
Give users a TopologySpreadConstraints field in the OpenShiftStateMetricsConfig field and propagate this to the pod that is created.
Give users a TopologySpreadConstraints field in the TelemeterClientConfig field and propagate this to the pod that is created.
Give users a TopologySpreadConstraints field in the PrometheusOperatorConfig field and propagate this to the pod that is created. This will take care of both the incluster PO and UWM PO.
Give users a TopologySpreadConstraints field in the ThanosQuerierConfig field and propagate this to the pod that is created.
IIUC, before using hard affinities for HA components, we needed this to avoid scheduling problems during upgrade.
See https://github.com/openshift/cluster-monitoring-operator/pull/1431#issuecomment-960845938
Now that 4.8 is no longer supported, we can get rid of this logic to simplify the code.
This will reduce technical debt and improve CMO learning curve.
To support the transition from soft anti-affinity to hard anti-affinity (4.9 > 4.10), CMO gained the ability to rebalance PVCs for Prometheus pods. The capability isn't required anymore so we can safely remove it.
Proposed title of this feature request
Enable the processes_linux collector in node_exporter
What is the nature and description of the request?
Enable node_exporter's processes_linux collector to allow customer to monitor the number of PIDs on OCP nodes.
Why does the customer need this? (List the business requirements)
They need to be able to monitor the number of PIDs on the OCP nodes.
List any affected packages or components.
cluster-monitoring-operator, node-exporter
We will add a section for "processes" Collector in "nodeExporter.collectors" section in CMO configmap.
It has a boolean field "enabled", the default value is false.
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | nodeExporter: collectors: # enable a collector which is disabled by default processes: enabled: true
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
Add new flags to utilise the existing resources in e2e test
Following issues need to be take care on cluster deletion with resource reuse flags.
With the above commit in 4.13, for powervs platform storage is not handled, which causes cluster image-registry operator to not getting installed.
Need to handle powervs platform here.
Options discussed are to go with pvc with CSI.
If its not feasible, will try to use IBMCOS used by satellite team.
Getting below error while deleting infra with failed powervs instance
Failed to destroy infrastructure {"error": "error in destroying infra: provided cloud instance id is not in active state, current state: failed"}
Also need to take care of create infra process in case of powervs instance goes to failed state. Looping on printing same statement while waiting for it to become active.
2022-11-11T13:03:01+05:30 INFO hyp-dhar-osa-2 Waiting for cloud instance to up {"id": "crn:v1:bluemix:public:power-iaas:osa21:a/c265c8cefda241ca9c107adcbbacaa84:cd743ba9-195b-46ba-951e-639f97f443d2::", "state": "failed"}
With the latest changes capi by default expects v1beta2 APIs. Need to upgrade capi API from v1beta1 to v1beta2 in hypershift.
When resources running short in management cluster when we deploy new apps, which evicts the cloud-controller-manager pod in existing HC's control plane.
Flags similar to these https://github.com/openshift/hypershift/blob/main/cmd/cluster/powervs/create.go#L57toL61 from create command are missing in destroy command, so that infra destroy functionality not getting these flags for proper destroy of infra with existing resources.
Issue and Design: https://github.com/ovn-org/ovn-kubernetes/blob/master/docs/design/shared_gw_dgp.md
Upstream PR: https://github.com/ovn-org/ovn-kubernetes/pull/3160
Document that describes how to use the mgmt port VF rep for hardware offloading: https://docs.google.com/document/d/1yR4lphjPKd6qZ9sGzZITl0wH1r4ykfMKPjUnlzvWji4/edit#
==========================================================================
After the upstream PR has been merged. We need to find a way to make the user experience configuring the mgmt port VF rep as streamlined as possible. Basic Streaming that we have committed to is improving the config map to only require the DP resource name with the MGMT VF in the pool. Also OVN-K will need to make use of DP resources.
Description of problem:
- Add support for Dynamic Creation Of DPU/Smart-NIC Daemon Sets and Device-Plugin Resources For OVN-K - DPU/Smart-NIC Daemonsets need a way to be dynamically created via specific node labels - The config map needs to support device plugin resources (namely SR-IOV) to be used for the management port configuration in OVN-K - This should enhance the performance of these flows (planned to be GA-ed in 4.14) for Smart-NIC 5-a: Pod -> NodePort Service traffic (Pod Backend - Same Node) 4-a: Pod -> Cluster IP Service traffic (Host Backend - Same Node)
Version-Release number of selected component (if applicable):
4.14.0 (Merged D/S) https://github.com/openshift/ovn-kubernetes/commit/cad6ed35183a6a5b43c1550ceb8457601b53460b https://github.com/openshift/cluster-network-operator/commit/0bb035e57ac3fd0ef7b1a9451336bfd133fa8c1e
How reproducible:
Never been supported in the past.
Steps to Reproduce:
Please follow the documentation on how to configure this on NVIDIA Smart-NICs in OvS HWOL mode. - https://issues.redhat.com/browse/NHE-550 Please also check the OVN-K daemon sets. There should be a new "smart-nic" daemon set for OVN-K. Please check on the nodes that the interface ovn-k8s-mp0_0 interface exists alongside ovn-k8s-mp0 interface.
Actual results:
Iperf3 performance: 5-a: Pod -> NodePort Service traffic (Pod Backend - Same Node) => ~22.5 Gbits/sec 4-a: Pod -> Cluster IP Service traffic (Host Backend - Same Node) => ~22.5 Gbits/sec
Expected results:
Iperf3 performance: 5-a: Pod -> NodePort Service traffic (Pod Backend - Same Node) => ~29 Gbits/sec 4-a: Pod -> Cluster IP Service traffic (Host Backend - Same Node) => ~29 Gbits/sec As you can see we can gain an additional 6.5 Gbits/sec performance with these service flows.
Additional info:
https://docs.google.com/spreadsheets/d/1LHY-Af-2kQHVwtW4aVdHnmwZLTiatiyf-ySffC8O5NM/edit#gid=88193790 https://github.com/ovn-org/ovn-kubernetes/pull/3160
NVIDIA and Microsoft have partnered to provide instances on Azure that use the security of NVIDIA Hopper GPU to create a Trusted Execution Environment (TEE) where the data is encrypted while processed. This is achieved by using the AMD's SEV-SNP extension, alongside the NVIDIA Hopper confidential computing capabilties.
The virtual machine created on Azure is the TEE, so any workload running within is protected from the Azure host. This is a good approach for customers to protect their data when running OpenShift on Azure, but it doesn't protect the data in a container from the OpenShift node. In this epic, we focus on protecting the OpenShift node from the Azure host.
Running workloads in CSP virtual machines doesn't protect the data from an attack on the virtualization host itself. If an attacker manages to read the host memory, they can get access to the virtual machines data, so it can break confidentiality or integrity. In the context of AI/ML, both the data and the model represent intellectual property and sensitive data, so customers will want to protect them from leaks.
NVIDIA and Microsoft are key partners for Red Hat for AI/ML in the public cloud. Being able to run workloads encrypted at rest, in transport and in process will allow creating a trusted solution for our customers, spanning from self-managed OpenShift clusters to Azure Red Hat OpenShift (ARO) clusters. This will strengthen OpenShift as the Kubernetes distribution of choice in public clouds.
Add support for OCP cluster creation with Confidential VMs on Azure to the OpenShift installer. The additional configuration options required are:
In addition, in order to create a Confidential VM in Azure, the OS image needs to have its Security Type defined as "Confindential VM" or "Confidential VM Supported".
The changes required are:
Resources:
As the OCM team member, I want to provide support for cluster service, and improve the usability and interoperability of Hypershift.
Integration Testing:
Beta:
GA:
GREEN | YELLOW | RED
GREEN = On track, minimal risk to target date.
YELLOW = Moderate risk to target date.
RED = High risk to target date, or blocked and need to highlight potential
risk to stakeholders.
Links to Gdocs, github, and any other relevant information about this epic.
The last version of OpenShift on RHV should target OpenShift 4.13. There are several factors for this requirement.
previous: The last OCP on RHV version will be 4.13. Remove RHV from OCP in OCP 4.14.
On August 31, 2022, Red Hat Virtualization enters the maintenance support phase, which runs until August 31, 2024. In accordance, Red Hat Virtualization (RHV) will be deprecated beginning with OpenShift v4.13. This means that RHV will be supported through OCP 4.13. RHV will be removed from OpenShift in OpenShift v4.14.
We will use this to address tech debt in OLM in the 4.10 timeframe.
Items to prioritize are:
CI e2e flakes
Update the downstream READMEs to better describe the downstereaming process.
Include help in the sync scripts as necessary/
It has been determined that "make verify" is a necessary part of the downstream process. The scripts that do the downstreaming do not run this command.
Add "make verify" somewhere in the downstreaming scripts, either as a last step in sync.sh or per commit (which might be both necessary yet overkill) in sync_pop_candidate.sh.
The client cert/key pair is a way of authenticating that will function even without live kube-apiserver connections so we can collect metrics if the kube-apiserver is unavailable.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of problem:
CU cluster of the Mavenir deployment has cluster-node-tuning-operator in a CrashLoopBackOff state and does not apply performance profile
Version-Release number of selected component (if applicable):
4.14rc0 and 4.14rc1
How reproducible:
100%
Steps to Reproduce:
1. Deploy CU cluster with ZTP gitops method 2. Wait for Policies to be complient 3. Check worker nodes and cluster-node-tuning-operator status
Actual results:
Nodes do not have performance profile applied cluster-node-tuning-operator is crashing with following in logs: E0920 12:16:57.820680 1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*runtime._type)(nil), concrete:(*runtime._type)(nil), asserted:(*runtime._type)(0x1e68ec0), missingMethod:""} (interface conversion: interface is nil, not v1.Object) goroutine 615 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1c98c20?, 0xc0006b7a70}) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000d49500?}) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x1c98c20, 0xc0006b7a70}) /usr/lib/golang/src/runtime/panic.go:884 +0x213 github.com/openshift/cluster-node-tuning-operator/pkg/util.ObjectInfo({0x0?, 0x0}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/util/objectinfo.go:10 +0x39 github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*ProfileCalculator).machineConfigLabelsMatch(0xc000a23ca0?, 0xc000445620, {0xc0001b38e0, 0x1, 0xc0010bd480?}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/profilecalculator.go:374 +0xc7 github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*ProfileCalculator).calculateProfile(0xc000607290, {0xc000a40900, 0x33}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/profilecalculator.go:208 +0x2b9 github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).syncProfile(0xc000195b00, 0x0?, {0xc000a40900, 0x33}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:664 +0x6fd github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).sync(0xc000195b00, {{0x1f48661, 0x7}, {0xc000000fc0, 0x26}, {0xc000a40900, 0x33}, {0x0, 0x0}}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:371 +0x1571 github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).eventProcessor.func1(0xc000195b00, {0x1dd49c0?, 0xc000d49500?}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:193 +0x1de github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).eventProcessor(0xc000195b00) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:212 +0x65 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x224ee20, 0xc000c48ab0}, 0x1, 0xc00087ade0) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc0004e6710?) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(0xc0004e67d0?, 0x91af86?, 0xc000ace0c0?) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25 created by github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).run /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:1407 +0x1ba5 panic: interface conversion: interface is nil, not v1.Object [recovered] panic: interface conversion: interface is nil, not v1.Object
Expected results:
cluster-node-tuning-operator is functional, performance profiles applied to worker nodes
Additional info:
There is no issue on a DU node of the same deployment coming from same repository, DU node is configured as requested and cluster-node-tuning-operator is functioning correctly. must gather from rc0: https://drive.google.com/file/d/1DlzrjQiKTVnQKXdcRIijBkEKjAGsOFn1/view?usp=sharing must gather from rc1: https://drive.google.com/file/d/1qSqQtIunQe5e1hDVDYwa90L9MpEjEA4j/view?usp=sharing performance profile: https://gitlab.cee.redhat.com/agurenko/mavenir-ztp/-/blob/airtel-4.14/policygentemplates/group-cu-mno-ranGen.yaml
Revived from OCSCNV-56 which was archived.
Need a solution to support OCS encrypted volume for CNV so that smart cloning across namespaces can be achieved for encrypted volume.
Now the problem with encrypted OCS volumes is secrets are stored in the original namespace and will get left behind. (The cloned metadata still points to the original namespace)
The annotation required is `cdi.kubevirt.io/clone-strategy=copy`.
Tasks:
Need to update: "console.storage-class/provisioner" extension.
Ref: https://github.com/openshift/console/pull/11931
Something like:
"properties" : { "CSI" : { . . "parameter" : { . . } "annotations" : { [annotationKey: string] : { "value" ?: string, "annotate" ?: CodeRef<(arg) => boolean | boolean> } . .
We can do same for `properties.others.annotations` as well (not a requirement, but to have consistency with `properties.csi.annotations`).
OpenShift Container Platform is shipping a finely tuned set of alerts to inform the cluster's owner and/or operator of events and bad conditions in the cluster.
Runbooks are associated with alerts and help SREs take action to resolve an alert. This is critical to share engineering best practices following an incident.
Goal 1: Current alerts/runbooks for hypershift needs to be evaluated to ensure we have sufficient coverage before hypershift hits GA.
Goal 2: Actionable runbooks need to be provided for all alerts therefore, we should attempt to cover as many as possible in this epic.
Goal 3: Continue adding alerts/runbooks to cover existing OVN-K functionality.
This epic will NOT cover refactors needed to alerts/runbooks due to new arch (OVN IC).
In-order to scale, we (engineering) must share our institutional knowledge.
In-order for SREs to respond to alerts, they must have the knowledge to do so.
SD needs to have actionable runbooks to respond to alerts otherwise, they will require engineering to engage more frequently.
Depends on https://issues.redhat.com/browse/SDN-3432
OVN controller disconnection alert is a warning alert and therefore require runbooks.
DoD: runbook(s) merged to https://github.com/openshift/runbooks/blob/master/alerts/cluster-network-operator/ and also runbook link add to CNO for the aforementioned alerts.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As an administrator of a cluster utilizing AWS STS with a public S3 bucket OIDC provider, I would like a documented procedure with steps that can be followed to migrate to a private S3 bucket with CloudFront Distribution so that I do not have to recreate my cluster.
ccoctl documentation including parameter `--create-private-s3-bucket`: https://github.com/openshift/cloud-credential-operator/blob/a8ee8a426d38cca3f7339ecd0eac88f922b6d5a0/docs/ccoctl.md
Existing manual procedure for configuring private S3 bucket with CloudFront Distribution: https://github.com/openshift/cloud-credential-operator/blob/master/docs/sts-private-bucket.md
Goal:
The participation on SPLAT will be:
ACCEPTANCE CRITERIA
REFERENCES:
Supporting document: https://github.com/openshift/cloud-credential-operator/blob/master/docs/sts.md#steps-to-in-place-migrate-an-openshift-cluster-to-sts
NOTE: we should add that this step is not supported or recommended.
We have identified gaps in our attempted test coverage that monitors for acceptable Alerts firing during cluster upgrades that need to be addressed to make sure we are not allowing regressions into the product.
This epic is to group that work.
This will make transitioning to new releases very simple because ci-tools doesn't need logic, it just makes sure to include current + previous release data in the file and pr going to origin. Origin is then responsible for logic to determine which to use. Origin will check if we have at least 100 runs, if not try to fall back to previous release data. All other fallback logic should exist.
legacy apiserver disruption
legacy network pod sandbox creation
kubelet logs through /api/v1/nodes/<node>/proxy/logs/
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Description of the problem:
Feature support of LSO is currently supporting only x86, this was an error due to https://github.com/openshift/assisted-service/blob/ca339ae3515df6c1394af8b43187e5be13d6306e/internal/operators/lso/ls_operator.go#L103
Description of problem:
When CNO is managed by Hypershift, it's deployment has "hypershift.openshift.io/release-image" template metadata annotation. The annotation's value is used to track progress of cluster control plane version upgrades. Example: apiVersion: apps/v1 kind: Deployment metadata: generation: 24 labels: hypershift.openshift.io/managed-by: control-plane-operator name: cluster-network-operator namespace: master-cg319sf10ghnddkvo8j0 ... spec: progressDeadlineSeconds: 600 ... template: metadata: annotations: hypershift.openshift.io/release-image: us.icr.io/armada-master/ocp-release:4.12.7-x86_64 target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}' ... The same annotation must be set by CNO on multus-admission-controller deployment so that service providers can track its version upgrades as well. CNO need a code fix to implement this annotation propagation logic.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1.Create OCP cluster using Hypershift 2.Check deployment template metadata annotations on multus-admission-controller
Actual results:
No "hypershift.openshift.io/release-image" deployment template metadata annotation exists
Expected results:
"hypershift.openshift.io/release-image" annotation must be present
Additional info:
Description of problem:
Control plane upgrades takes about 23 minutes on average. The shortest time I saw was 14 minutes, and the longest is 43 minutes.
The requirement is < 10 min for a successful complete control plane upgrade.
Version-Release number of selected component (if applicable): 4.12.12
How reproducible:
100 %
Steps to Reproduce:
1. Install a hosted cluster on 4.12.12. Wait for it to be 'ready'. 2. Upgrade the control plane to 4.12.13 via OCM.
Actual results: upgrade completes on average after 23 minutes.
Expected results: upgrade completes after < 10 min
Additional info:
N/A
When the user is providing ZTP manifests, a missing value for userManagedNetworking (in AgentClusterInstall) should be defaulted based on the platform type - for platform None this should default to true.
This is only happening if the platform type is misspelled as none instead of None. (Both are accepted for backwards compat with OCPBUGS-7495, but they should not result in different behaviour.)
When the user starts from an install-config, we set the correct value explicitly in the generated AgentClusterInstall, so this is not a problem so long as the user doesn't edit it.
Description of problem:
This test is permafailing on techpreview since https://github.com/openshift/origin/pull/27915 landed [sig-instrumentation][Late] Alerts shouldn't exceed the 650 series limit of total series sent via telemetry from each cluster [Suite:openshift/conformance/parallel] s: "promQL query returned unexpected results:\navg_over_time(cluster:telemetry_selected_series:count[49m15s]) >= 650\n[\n {\n \"metric\": {\n \"prometheus\": \"openshift-monitoring/k8s\"\n },\n \"value\": [\n 1685504058.881,\n \"700.3636363636364\"\n ]\n }\n]",
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Run conformance tests on a techpreview cluster
Actual results:
Test fails
Expected results:
Test succeeds
Additional info:
Example job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-techpreview/1663723476923453440
Description of problem:
Due to security vulnerability[1] affecting Azure CLI versions previous to 2.40.0(not included), it is recommended to update azure cli to higher version to avoid this issue. Currently, azure cli in CI is 2.38.0. [1] https://github.com/Azure/azure-cli/security/advisories/GHSA-47xc-9rr2-q7p4
Version-Release number of selected component (if applicable):
All supported version
How reproducible:
Always
Steps to Reproduce:
1. Trigger CI jobs on azure platform that contains steps using azure cli. 2. 3.
Actual results:
azure cli 2.38.0 is used now.
Expected results:
azure cli 2.40.0+ to be used in CI on all supported version
Additional info:
As azure cli 2.40.0+ is only available in rhel8-based repository, need to update its repo in upi-installer rhel8-based docker file[1] [1] https://github.com/openshift/installer/blob/master/images/installer/Dockerfile.upi.ci.rhel8#L23
Description of problem:
We suspect that https://github.com/openshift/oc/pull/1521 has broken all Metal jobs, an example of a failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/355/pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi-ovn-ipv6/1691359315740332032.
Details:
The testing scripts we use set KUBECONFIG in advance to the location where we'll create it. At the time "oc adm extract" is called, the file does not exist yet. While you could argue that we should not do it, it has worked for years, and it's quite possible that customers have similar automation (e.g. setting KUBECONFIG as a global variable in their playbooks). In any case, I don't think "oc adm extract" should try to read the configuration if it does not explicitly need it.
Updated details:
After the change, "oc adm extract" expects KUBECONFIG to be present, but at the point when we call it, there is no cluster. I initially assumed that unsetting KUBECONFIG will help but it does not.
Please review the following PR: https://github.com/openshift/cloud-provider-aws/pull/37
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Our telemetry contains only vCenter version ("7.0.3") and not the exact build number. We need the build number to know what exact vCenter build user has and what bugs are fixed there (e.g. https://issues.redhat.com/browse/OCPBUGS-5817).
CI is flaky because the TestClientTLS test fails.
I have seen these failures in 4.13 and 4.14 CI jobs.
Presently, search.ci reports the following stats for the past 14 days:
Found in 16.07% of runs (20.93% of failures) across 56 total runs and 13 jobs (76.79% failed) in 185ms
1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=FAIL%3A+TestAll%2Fparallel%2FTestClientTLS&maxAge=336h&context=1&type=all&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.
The test fails:
=== RUN TestAll/parallel/TestClientTLS === PAUSE TestAll/parallel/TestClientTLS === CONT TestAll/parallel/TestClientTLS === CONT TestAll/parallel/TestClientTLS stdout: Healthcheck requested 200 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [8 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [313 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [313 bytes data] * TLSv1.3 (IN), TLS app data, [no content] (0): { [1 bytes data] < HTTP/1.1 200 OK < x-request-port: 8080 < date: Wed, 22 Mar 2023 18:56:24 GMT < content-length: 22 < content-type: text/plain; charset=utf-8 < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=683c60a6110214134bed475edc895cb9; path=/; HttpOnly; Secure; SameSite=None < cache-control: private < { [22 bytes data] * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact stdout: Healthcheck requested 200 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [799 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS app data, [no content] (0): { [1 bytes data] < HTTP/1.1 200 OK < x-request-port: 8080 < date: Wed, 22 Mar 2023 18:56:24 GMT < content-length: 22 < content-type: text/plain; charset=utf-8 < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=eb40064e54af58007f579a6c82f2bcd7; path=/; HttpOnly; Secure; SameSite=None < cache-control: private < { [22 bytes data] * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact stdout: Healthcheck requested 200 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [802 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS app data, [no content] (0): { [1 bytes data] < HTTP/1.1 200 OK < x-request-port: 8080 < date: Wed, 22 Mar 2023 18:56:25 GMT < content-length: 22 < content-type: text/plain; charset=utf-8 < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=104beed63d6a19782a5559400bd972b6; path=/; HttpOnly; Secure; SameSite=None < cache-control: private < { [22 bytes data] * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact stdout: 000 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [799 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS alert, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS alert, unknown CA (560): { [2 bytes data] * OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0 * Closing connection 0 curl: (56) OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0 === CONT TestAll/parallel/TestClientTLS stdout: 000 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [8 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS alert, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS alert, unknown (628): { [2 bytes data] * OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0 * Closing connection 0 curl: (56) OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0 === CONT TestAll/parallel/TestClientTLS stdout: Healthcheck requested 200 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [799 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS app data, [no content] (0): { [1 bytes data] < HTTP/1.1 200 OK < x-request-port: 8080 < date: Wed, 22 Mar 2023 18:57:00 GMT < content-length: 22 < content-type: text/plain; charset=utf-8 < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=683c60a6110214134bed475edc895cb9; path=/; HttpOnly; Secure; SameSite=None < cache-control: private < { [22 bytes data] * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact === CONT TestAll/parallel/TestClientTLS stdout: Healthcheck requested 200 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [802 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS app data, [no content] (0): { [1 bytes data] < HTTP/1.1 200 OK < x-request-port: 8080 < date: Wed, 22 Mar 2023 18:57:00 GMT < content-length: 22 < content-type: text/plain; charset=utf-8 < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=eb40064e54af58007f579a6c82f2bcd7; path=/; HttpOnly; Secure; SameSite=None < cache-control: private < { [22 bytes data] * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact === CONT TestAll/parallel/TestClientTLS stdout: 000 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [799 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS alert, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS alert, unknown CA (560): { [2 bytes data] * OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0 * Closing connection 0 curl: (56) OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0 === CONT TestAll/parallel/TestClientTLS --- FAIL: TestAll (1538.53s) --- FAIL: TestAll/parallel (0.00s) --- FAIL: TestAll/parallel/TestClientTLS (123.10s)
CI passes, or it fails on a different test.
I saw that TestClientTLS failed on the test case with no client certificate and ClientCertificatePolicy set to "Required". My best guess is that the test is racy and is hitting a terminating router pod. The test uses waitForDeploymentComplete to wait until all new pods are available, but perhaps waitForDeploymentComplete should also wait until all old pods are terminated.
Description of problem:
During a fresh install of an operator with conversion webhooks enabled, `crd.spec.conversion.webhook.clientConfig` is dynamically updated initially, as expected, with the proper webhook ns, name, & caBundle. However, within a few seconds, those critical settings are overwritten with the bundle’s packaged CRD conversion settings. This breaks the operator and stops the installation from completing successfully. Oddly though, if that same operator version is installed as part of an upgrade from a prior release... the dynamic clientConfig settings are retained and all works as expected.
Version-Release number of selected component (if applicable):
OCP 4.10.36 OCP 4.11.18
How reproducible:
Consistently
Steps to Reproduce:
1. oc apply -f https://gist.githubusercontent.com/tchughesiv/0951d40f58f2f49306cc4061887e8860/raw/3c7979b58705ab3a9e008b45a4ed4abc3ef21c2b/conversionIssuesFreshInstall.yaml 2. oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}' -w
Actual results:
Eventually, the clientConfig settings will revert to the following and stay that way. $ oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}' map[service:map[name:dbaas-operator-webhook-service namespace:openshift-dbaas-operator path:/convert port:443]]
conversion: strategy: Webhook webhook: clientConfig: service: namespace: openshift-dbaas-operator name: dbaas-operator-webhook-service path: /convert port: 443 conversionReviewVersions: - v1alpha1 - v1beta1
Expected results:
The `crd.spec.conversion.webhook.clientConfig` should instead retain the following settings. $ oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}' map[caBundle:LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJpRENDQVMyZ0F3SUJBZ0lJUVA1b1ZtYTNqUG93Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TWpFeU1UWXhPVEEwTWpsYUZ3MHlOREV5TVRVeE9UQTBNamxhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBVGcxaEtPWW40MStnTC9PdmVKT21jbkx5MzZNWTBEdnRGcXF3cjJFdlZhUWt2WnEzWG9ZeWlrdlFlQ29DZ3QKZ2VLK0UyaXIxNndzSmRSZ2paYnFHc3pGbzJFd1h6QU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFCkZPMWNXNFBrbDZhcDdVTVR1UGNxZWhST1gzRHZNQW9HQ0NxR1NNNDlCQU1DQTBrQU1FWUNJUURxN0pkUjkxWlgKeWNKT0hyQTZrL0M0SG9sSjNwUUJ6bmx3V3FXektOd0xiZ0loQU5ObUd6RnBqaHd6WXpVY2RCQ3llU3lYYkp3SAphYllDUXFkSjBtUGFha28xCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K service:map[name:dbaas-operator-controller-manager-service namespace:redhat-dbaas-operator path:/convert port:443]]
conversion: strategy: Webhook webhook: clientConfig: service: namespace: redhat-dbaas-operator name: dbaas-operator-controller-manager-service path: /convert port: 443 caBundle: >- LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJoekNDQVMyZ0F3SUJBZ0lJZXdhVHNLS0hhbWd3Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TWpFeU1UWXhPVEF5TURkYUZ3MHlOREV5TVRVeE9UQXlNRGRhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBUVRFQm8zb1BWcjRLemF3ZkE4MWtmaTBZQTJuVGRzU2RpMyt4d081ZmpKQTczdDQ2WVhOblFzTjNCMVBHM04KSXJ6N1dKVkJmVFFWMWI3TXE1anpySndTbzJFd1h6QU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFCkZJemdWbC9ZWkFWNmltdHl5b0ZkNFRkLzd0L3BNQW9HQ0NxR1NNNDlCQU1DQTBnQU1FVUNJRUY3ZXZ0RS95OFAKRnVrTUtGVlM1VkQ3a09DRzRkdFVVOGUyc1dsSTZlNEdBaUVBZ29aNmMvYnNpNEwwcUNrRmZSeXZHVkJRa25SRwp5SW1WSXlrbjhWWnNYcHM9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
Additional info:
If the operator is, instead, installed as an upgrade... vs a fresh install... the webhook settings are properly/permanently set and everything works as expected. This can be tested in a fresh cluster like this. 1. oc apply -f https://gist.githubusercontent.com/tchughesiv/703109961f22ab379a45a401be0cf351/raw/2d0541b76876a468757269472e8e3a31b86b3c68/conversionWorksUpgrade.yaml 2. oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}' -w
Description of problem:
4.12.0-0.nightly-2022-09-20-095559 fresh cluster, alertmanager pod restarted once to become ready, this is a 4.12 regression, we should make sure the /etc/alertmanager/config_out/alertmanager.env.yaml exists first
# oc -n openshift-monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 6/6 Running 1 (118m ago) 118m alertmanager-main-1 6/6 Running 1 (118m ago) 118m ... # oc -n openshift-monitoring describe pod alertmanager-main-0 ... Containers: alertmanager: Container ID: cri-o://31b6f3231f5a24fe85188b8b8e26c45b660ebc870ee6915919031519d493d7f8 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34003d434c6f07e4af6e7a52e94f703c68e1f881e90939702c764729e2b513aa Ports: 9094/TCP, 9094/UDP Host Ports: 0/TCP, 0/UDP Args: --config.file=/etc/alertmanager/config_out/alertmanager.env.yaml --storage.path=/alertmanager --data.retention=120h --cluster.listen-address=[$(POD_IP)]:9094 --web.listen-address=127.0.0.1:9093 --web.external-url=https:/console-openshift-console.apps.qe-daily1-412-0922.qe.azure.devcluster.openshift.com/monitoring --web.route-prefix=/ --cluster.peer=alertmanager-main-0.alertmanager-operated:9094 --cluster.peer=alertmanager-main-1.alertmanager-operated:9094 --cluster.reconnect-timeout=5m --web.config.file=/etc/alertmanager/web_config/web-config.yaml State: Running Started: Wed, 21 Sep 2022 19:40:14 -0400 Last State: Terminated Reason: Error Message: s=2022-09-21T23:40:06.507Z caller=main.go:231 level=info msg="Starting Alertmanager" version="(version=0.24.0, branch=rhaos-4.12-rhel-8, revision=4efb3c1f9bc32ba0cce7dd163a639ca8759a4190)" ts=2022-09-21T23:40:06.507Z caller=main.go:232 level=info build_context="(go=go1.18.4, user=root@b2df06f7fbc3, date=20220916-18:08:09)" ts=2022-09-21T23:40:07.119Z caller=cluster.go:260 level=warn component=cluster msg="failed to join cluster" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n" ts=2022-09-21T23:40:07.119Z caller=cluster.go:262 level=info component=cluster msg="will retry joining cluster every 10s" ts=2022-09-21T23:40:07.119Z caller=main.go:329 level=warn msg="unable to join gossip mesh" err="2 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\n" ts=2022-09-21T23:40:07.119Z caller=cluster.go:680 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s ts=2022-09-21T23:40:07.173Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml ts=2022-09-21T23:40:07.174Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="open /etc/alertmanager/config_out/alertmanager.env.yaml: no such file or directory" ts=2022-09-21T23:40:07.174Z caller=cluster.go:689 level=info component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=54.469985ms Exit Code: 1 Started: Wed, 21 Sep 2022 19:40:06 -0400 Finished: Wed, 21 Sep 2022 19:40:07 -0400 Ready: True Restart Count: 1 Requests: cpu: 4m memory: 40Mi Startup: exec [sh -c exec curl --fail http://localhost:9093/-/ready] delay=20s timeout=3s period=10s #success=1 #failure=40 ... # oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- cat /etc/alertmanager/config_out/alertmanager.env.yaml "global": "resolve_timeout": "5m" "inhibit_rules": - "equal": - "namespace" - "alertname" "source_matchers": - "severity = critical" "target_matchers": - "severity =~ warning|info" - "equal": - "namespace" - "alertname" ...
Version-Release number of selected component (if applicable):
# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-09-20-095559 True False 109m Cluster version is 4.12.0-0.nightly-2022-09-20-095559
How reproducible:
always
Steps to Reproduce:
1. see the steps 2. 3.
Actual results:
alertmanager pod restarted once to become ready
Expected results:
no restart
Additional info:
no issue with 4.11
# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-09-20-140029 True False 16m Cluster version is 4.11.0-0.nightly-2022-09-20-140029 # oc -n openshift-monitoring get pod | grep alertmanager-main alertmanager-main-0 6/6 Running 0 54m alertmanager-main-1 6/6 Running 0 55m
Description of problem:
library-go should use Lease for leader election by default. In 4.10 we switched from configmaps to configmapsleases, now we can switch to leases change library-go to use lease by default, we already have an open pr for that: https://github.com/openshift/library-go/pull/1448 once the pr merges, we should revendor library-go for: - kas operator - oas operator - etcd operator - kcm operator - openshift controller manager operator - scheduler operator - auth operator - cluster policy controller
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Critical Alert Rules do not have runbook url
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
This bug is being raised by Openshift Monitoring team as part of effort to detect invalid Alert Rules in OCP. 1. Check details of MultipleDefaultStorageClasses Alert Rule 2. 3.
Actual results:
The Alert Rule MultipleDefaultStorageClasses has Critical Severity, but does not have runbook_url annotation.
Expected results:
All Critical Alert Rules must have runbbok_url annotation
Additional info:
Critical Alerts must have a runbook, please refer to style guide at https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide The runbooks are located at github.com/openshift/runbooks To resolve the bug, - Add runbooks for the relevant Alerts at github.com/openshift/runbooks - Add the link to the runbook in the Alert annotation 'runbook_url' - Remove the exception in the origin test, added in PR https://github.com/openshift/origin/pull/27933
Description of problem:
The vsphere-problem-detector feature is triggering VSphereOpenshiftClusterHealthFail alerts regarding “CheckFolderPermissions” and “CheckDefaultDatastore” after upgrading from 4.9.54. Forcing users to update configuration solely to get around the problem detector. Depending on the customer policies around vCenter passwords or configuration updates, this can be a major obstacle for a user who wants to keep the current vSphere settings since they've worked correctly in the previous Openshift versions.
Version-Release number of selected component (if applicable):
4.10.55
How reproducible:
Consistently
Steps to Reproduce:
1.Upgrading a cluster to 4.10 with invalid vSphere credentials
Actual results:
The cluster-storage-operator fires alarms regarding vSphere configuration in Openshift.
Expected results:
Bypass the vsphere-problem-detector if the user doesn't want to make a config change, since the setup is working, and upgrades like this succeeded for user previous to 4.10.
Additional info:
Description of problem:
Create Serverless Function Form is Broken
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always on Master.
Steps to Reproduce:
1. Go to Add Page 2. Click Create Serverless Function form
Actual results:
Form throwing error.
Expected results:
Form should open and submit
Screenshot of Error: https://drive.google.com/file/d/1uyzGHktfr8tEGWPyYkv9ISYI6BhdnK6f/view?usp=sharing
Additional info:
One of the 4.13 nightly payload test is failing and it seems like kernel-uname-r is needed in base RHCOS.
Error message from rpm-ostree rebase made
Problem: package kernel-modules-core-5.14.0-284.25.1.el9_2.x86_64 requires kernel-uname-r = 5.14.0-284.25.1.el9_2.x86_64, but none of the providers can be installed
- conflicting requests
Perhaps something changed recently in packaging.
Description of problem:
Test in periodic job of 4.13 release fails in about 30% jobs: [rfe_id:27363][performance] CPU Management Hyper-thread aware scheduling for guaranteed pods Verify Hyper-Thread aware scheduling for guaranteed pods [test_id:46959] Number of CPU requests as multiple of SMT count allowed when HT enabled
Version-Release number of selected component (if applicable):
4.13
How reproducible:
In periodic jobs
Steps to Reproduce:
Run cnf tests on 4.13
Actual results:
Expected results:
Additional info:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-telco5g-cnftests/1628395172440051712/artifacts/e2e-telco5g-cnftests/telco5g-cnf-tests/artifacts/test_results.html
Baremetal ipi jobs are failing in 4.14 CI since May 12th
bootkube is failing to start with
May 15 10:11:56 localhost.localdomain systemd[1]: Started Bootstrap a Kubernetes cluster. May 15 10:12:04 localhost.localdomain bootkube.sh[82661]: Rendering Kubernetes Controller Manager core manifests... May 15 10:12:09 localhost.localdomain bootkube.sh[84029]: F0515 10:12:09.396398 1 render.go:45] error getting FeatureGates: error creating feature accessor: unable to determine features: missing desired version "4.14.0-0.nightly-2023-05-12-121801" in featuregates.config.openshift.io/cluster May 15 10:12:09 localhost.localdomain systemd[1]: bootkube.service: Main process exited, code=exited, status=255/EXCEPTION May 15 10:12:09 localhost.localdomain systemd[1]: bootkube.service: Failed with result 'exit-code'.
Description of problem:
Cluster deployment of 4.14.0-0.nightly-2023-06-20-065807 fails as worker nodes are stuck in INSPECTING state despite being reported as MANAGEABLE
From the logs of machine-controller container in machine-api-controllers pod:
I0621 06:12:02.779472 1 request.go:682] Waited for 2.095824347s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v2?timeout=32s E0621 06:12:02.781540 1 logr.go:270] controller-runtime/source "msg"="if kind is a CRD, it should be installed before calling Start" "error"="no matches for kind \"Metal3Remediation\" in version \"infrastructure.cluster.x-k8s.io/v1beta1\"" "kind"={"Group":"infrastructure.cluster.x-k8s.io","Kind":"Metal3Remediation"} I0621 06:12:02.783418 1 controller.go:179] kni-qe-4-tj65t-worker-0-h6s8g: reconciling Machine 2023/06/21 06:12:02 Checking if machine kni-qe-4-tj65t-worker-0-h6s8g exists. 2023/06/21 06:12:02 Machine kni-qe-4-tj65t-worker-0-h6s8g does not exist. I0621 06:12:02.783439 1 controller.go:372] kni-qe-4-tj65t-worker-0-h6s8g: reconciling machine triggers idempotent create 2023/06/21 06:12:02 Creating machine kni-qe-4-tj65t-worker-0-h6s8g 2023/06/21 06:12:02 0 hosts available while choosing host for machine 'kni-qe-4-tj65t-worker-0-h6s8g' 2023/06/21 06:12:02 No available BareMetalHost found W0621 06:12:02.783735 1 controller.go:374] kni-qe-4-tj65t-worker-0-h6s8g: failed to create machine: requeue in: 30s I0621 06:12:02.783748 1 controller.go:404] Actuator returned requeue-after error: requeue in: 30s I0621 06:12:02.783780 1 controller.go:179] kni-qe-4-tj65t-worker-0-j259x: reconciling Machine 2023/06/21 06:12:02 Checking if machine kni-qe-4-tj65t-worker-0-j259x exists. 2023/06/21 06:12:02 Machine kni-qe-4-tj65t-worker-0-j259x does not exist. I0621 06:12:02.783792 1 controller.go:372] kni-qe-4-tj65t-worker-0-j259x: reconciling machine triggers idempotent create 2023/06/21 06:12:02 Creating machine kni-qe-4-tj65t-worker-0-j259x 2023/06/21 06:12:02 0 hosts available while choosing host for machine 'kni-qe-4-tj65t-worker-0-j259x' 2023/06/21 06:12:02 No available BareMetalHost found W0621 06:12:02.783971 1 controller.go:374] kni-qe-4-tj65t-worker-0-j259x: failed to create machine: requeue in: 30s I0621 06:12:02.783976 1 controller.go:404] Actuator returned requeue-after error: requeue in: 30s
BMH Resources:
oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE openshift-machine-api openshift-master-0 externally provisioned kni-qe-4-tj65t-master-0 true 175m openshift-machine-api openshift-master-1 externally provisioned kni-qe-4-tj65t-master-1 true 175m openshift-machine-api openshift-master-2 externally provisioned kni-qe-4-tj65t-master-2 true 175m openshift-machine-api openshift-worker-0 inspecting true 175m openshift-machine-api openshift-worker-1 inspecting true 175m
From Ironic:
baremetal node list +--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+ | 86f146e3-3e48-4a7a-b0ef-57c42083fc92 | openshift-machine-api~openshift-master-0 | 7eeb9e57-2df2-4710-82d9-d3f99a20348e | power on | active | False | | 2380f211-934f-4193-8cb1-d09e7008410c | openshift-machine-api~openshift-master-2 | fd856ced-2912-4800-848c-256c00a1fdb7 | power on | active | False | | 9ad70c58-de44-4d56-9304-4bf7c95de6fb | openshift-machine-api~openshift-master-1 | aa1a4c89-4215-44ec-90c7-9c5f3de95ab8 | power on | active | False | | bb5ea5f4-016c-4bdd-834d-61d575284bf3 | openshift-machine-api~openshift-worker-0 | None | power off | manageable | False | | 3045a07a-09d6-43a0-ab9c-d856b54bad6c | openshift-machine-api~openshift-worker-1 | None | power off | manageable | False | +--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-20-065807
How reproducible:
so far once
Steps to Reproduce:
1. Deploy baremetal dualstack cluster with day1 networking
Actual results:
Deployment fails as worker nodes are not provisioned
Expected results:
Deployment succeeds
Description of problem: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.14-e2e-openstack-sdn/1682353286402805760 failed with:
fail [github.com/openshift/origin/test/extended/authorization/scc.go:69]: 2 pods failed before test on SCC errors Error creating: pods "openstack-cinder-csi-driver-controller-7c4878484d-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[0].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[0].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[0].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[0].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[1].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[1].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[1].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[1].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[1].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[1].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[2].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[2].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[2].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[2].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[2].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[2].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[3].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[3].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[3].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[3].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[3].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[3].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[4].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[4].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[4].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[4].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[4].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[4].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[5].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[5].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[5].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[5].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[5].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[5].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[6].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[6].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[6].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[6].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[6].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[6].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[7].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[7].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[7].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[7].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[7].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[7].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[8].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[8].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[8].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[8].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[8].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[8].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider restricted-v2: .containers[9].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[9].containers[0].hostPort: Invalid value: 10301: Host ports are not allowed to be used, provider restricted-v2: .containers[9].containers[2].hostPort: Invalid value: 9202: Host ports are not allowed to be used, provider restricted-v2: .containers[9].containers[4].hostPort: Invalid value: 9203: Host ports are not allowed to be used, provider restricted-v2: .containers[9].containers[6].hostPort: Invalid value: 9204: Host ports are not allowed to be used, provider restricted-v2: .containers[9].containers[8].hostPort: Invalid value: 9205: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for ReplicaSet.apps/v1/openstack-cinder-csi-driver-controller-7c4878484d -n openshift-cluster-csi-drivers happened 13 times Error creating: pods "openstack-cinder-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[8]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[0].capabilities.add: Invalid value: "SYS_ADMIN": capability may not be added, provider restricted-v2: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[0].allowPrivilegeEscalation: Invalid value: true: Allowing privilege escalation for containers is not allowed, provider restricted-v2: .containers[1].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[1].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[2].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[2].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/openstack-cinder-csi-driver-node -n openshift-cluster-csi-drivers happened 12 times
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
OKD/FCOS uses FCOS as its bootimage, i.e. when booting cluster nodes the first time during installation. FCOS does not provide tools such as OpenShift Client (oc) or hyperkube which are used during single-node cluster installation at first boot (e.g. oc in bootkube.sh) and thus setup fails.
Version-Release number of selected component (if applicable):
4.14
Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/197
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
As a user, I would like to see the type of technology used by the samples on the samples view similar to the all services view.
On the samples view:
It is showing different types of samples, e.g. devfile, helm and all showing as .NET. It is difficult for user to decide which .Net entry to select on the list. We'll need something like the all service view where it shows the type of technology on the top right of each card for users to differentiate between the entries:
Remove list bullets
Need space between "Phase" and status icon
Description of problem:
The ExternalLink 'OpenShift Pipelines based on Tekton' in Pipeline Build Strategy deprecation Alert is incorrect, currently it's defined as https://openshift.github.io/pipelines-docs/ and will redirect to a 'Not found' page
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-04-133505
How reproducible:
Always
Steps to Reproduce:
1. $oc new app -n test https://github.com/openshift/cucushift/blob/master/testdata/pipeline/samplepipeline.yaml OR Create Jenkins server and Pipeline BC $ oc new-app https://raw.githubusercontent.com/openshift/origin/master/examples/jenkins/jenkins-ephemeral-template.json $ oc new-app -f https://raw.githubusercontent.com/openshift/origin/master/examples/jenkins/pipeline/samplepipeline.yaml 2. Admin user login console and navigate to Builds -> Build Configs -> sample-pipeline Details page 3.Check the External link 'OpenShift Pipelines based on Tekton' in the 'Pipeline build strategy deprecation' Alert
Actual results:
Now a 'Not found' page would be redirected for the user
Expected results:
The link should be correct and existing
Additional info:
Impact file build.tsx https://github.com/openshift/console/blob/a0e7e98e5ffe4aca73f9f1f441d15cc4e9b33ee6/frontend/public/components/build.tsx#LL238C17-L238C60 Base bug: https://bugzilla.redhat.com/show_bug.cgi?id=1768350
Description of the problem:
Debug info is not printed for data collection
How reproducible:
Always
Steps to reproduce:
1. Deploy MCE multicluster-engine.v2.3.0-81.
2. Enable log level debug for AI
3. Deploy spoke multinode 4.12
Actual results:
No debug info printed.
Expected results:
should print debug info :
log.Debugf("Red Hat Insights Request ID: %+v", res.Header.Get("X-Rh-Insights-Request-Id"))
Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/64
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
This issue was supposed to be fixed in 4.13.4 but is happening again. Manually creating the directory "/etc/systemd/network" allow to complete the upgrade but is not a sustainable workaround when there are several cluster to update.
Version-Release number of selected component (if applicable):
4.13.4
How reproducible:
At customer environment.
Steps to Reproduce:
1. Update to 4.13.4 from 4.12.21 2. 3.
Actual results:
MCO degraded blocking the upgrade.
Expected results:
Upgrade to complete.
Additional info:
Description of problem:
The HCP Create NodePool AWS Render command does not work correctly since it does not render a specification with the arch and instance type defined.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
No arch or instance type defined in specification.
Expected results:
Arch and instance type defined in specification.
Additional info:
When we create an HCP, the Root CA in the HCP namespaces has the certificate and key named as
Done criteria: The Root CA should have the certificate and key named as the cert manager expects.
Description of problem:
Once a user makes a change to the log component from master node's log section, then the user is unable to change or select a different log component from the dropdown.
To make different log component selection , the user needs to revisit the logs section under master node again and this refreshes the pane and reloads to default options.
Version-Release number of selected components (if applicable):
4.11.0-0.nightly-2022-08-15-152346
How reproducible:
Always
Steps to Reproduce:
Actual results:
Unable to select or change the log component once the user already made a selection from the dropdown under master nodes' logs section.
Expected results:
Users should be allowed to change/select the log component from master node's logs section whenever required with the help of available dropdown.
Additional info:
Reproduced in both chrome[103.0.5060.114 (Official Build) (64-bit)] and firefox[91.11.0esr (64-bit)] browsers
Attached screen capture for the same.ScreenRecorder_2022-08-16_26457662-aea5-4a00-aeb4-0fbddf8f16f0.mp4
Description of problem:
Azure CCM should be GA before the end of 4.14. When we previously tried to promote it there were issues, so we need to improve the feature gates promotion so that we can promote all components in a single release. And then promote the CCM to GA once those changes are in place.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1137
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
All the DaemonSets defined within the openshift-multus namespace have a node selector predicate on the kubernetes.io/os label to schedule the daemonset's pods only on linux workers. The wherebout-reconciler seems missing it. We might need to add the `kubernetes.io/os: linux` label to stay consistent with the other daemonsets definitions and avoid risks in case of clusters with windows workers.
Version-Release number of selected component (if applicable):
4.13+
How reproducible:
Always
Steps to Reproduce:
1. oc get daemonsets -n openshift-multus NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE multus 6 6 6 6 6 kubernetes.io/os=linux 4h1m multus-additional-cni-plugins 6 6 6 6 6 kubernetes.io/os=linux 4h1m multus-networkpolicy 6 6 6 6 6 kubernetes.io/os=linux 19s
Actual results:
network-metrics-daemon 6 6 6 6 6 kubernetes.io/os=linux 4h1m whereabouts-reconciler 6 6 6 6 6 <none> 23s note the missing kuberentes.io/os nodeselector
Expected results:
The whereabouts-reconciler should also have the nodeselecto term kubernetes.io/os: linux.
Additional info:
https://redhat-internal.slack.com/archives/CFFSAHWHF/p1687158805205059
Description of problem:
The oc binary stored at /usr/local/bin in the cli-artifacts image of a non-amd64 payload is not the one for the architecture bound to the payload. It is an amd64 binary.
Version-Release number of selected component (if applicable):
4.11.4
How reproducible:
always
Steps to Reproduce:
1. CLI_ARTIFACTS_IMAGE=$(oc adm release info quay.io/openshift-release-dev/ocp-release:4.11.4-aarch64 --image-for=cli-artifacts) 2. CONTAINER=$(podman create $CLI_ARTIFACTS_IMAGE) 3. podman cp $CONTAINER:/usr/bin/oc /tmp/oc 4. file /tmp/oc
Actual results:
/tmp/oc: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked,.....
Expected results:
It should be a binary bound to the architecture for which the image is built. i.e., using the above aarch64 payload should lead to an arm64 binary at /usr/bin and the other arches bins in /usr/share/openshift
Additional info:
https://github.com/openshift/oc/blob/master/images/cli-artifacts/Dockerfile.rhel#L13
Description of problem:
Create two custom SCCs with different permissions, for example, custom-scc-1 with 'privileged' and custom-scc-2 with 'restricted'. Deploy a pod with annotations "openshift.io/required-scc: custom-scc-1, custom-scc-2". Pod deployment failed with error "Error creating: pods "test-747555b669-" is forbidden: required scc/custom-restricted-v2-scc, custom-privileged-scc not found". The system fails to provide appropriate error messages for multiple required SCC annotations, leaving users unable to identify the cause of the failure effectively.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-31-181848
How reproducible:
Always
Steps to Reproduce:
$ oc login -u testuser-0 $ oc new-project scc-test $ oc create sa scc-test -n scc-test serviceaccount/scc-test created $ oc get scc restricted-v2 -o yaml --context=admin > custom-restricted-v2-scc.yaml $ sed -i -e 's/restricted-v2/custom-restricted-v2-scc/g' -e "s/MustRunAsRange/RunAsAny/" -e "s/priority: null/priority: 10/" custom-restricted-v2-scc.yaml $ oc create -f custom-restricted-v2-scc.yaml --context=admin securitycontextconstraints.security.openshift.io/custom-restricted-v2-scc created $ oc adm policy add-scc-to-user custom-restricted-v2-scc system:serviceaccount:scc-test:scc-test --context=admin clusterrole.rbac.authorization.k8s.io/system:openshift:scc:custom-restricted-v2-scc added: "scc-test" $ oc get scc privileged -o yaml --context=admin > custom-privileged-scc.yaml $ sed -i -e 's/privileged/custom-privileged-scc/g' -e "s/priority: null/priority: 5/" custom-privileged-scc.yaml $ oc create -f custom-privileged-scc.yaml --context=admin securitycontextconstraints.security.openshift.io/custom-privileged-scc created $ oc adm policy add-scc-to-user custom-privileged-scc system:serviceaccount:scc-test:scc-test --context=admin clusterrole.rbac.authorization.k8s.io/system:openshift:scc:custom-privileged-scc added: "scc-test" $ cat deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: test spec: selector: matchLabels: deployment: test template: metadata: annotations: openshift.io/required-scc: custom-restricted-v2-scc, custom-privileged-scc labels: deployment: test spec: containers: - args: - infinity command: - sleep image: fedora:latest name: sleeper securityContext: runAsNonRoot: true serviceAccountName: scc-test $ oc create -f deployment.yaml deployment.apps/test created $ oc describe rs test-747555b669 | grep FailedCreate ReplicaFailure True FailedCreate Warning FailedCreate 61s (x15 over 2m23s) replicaset-controller Error creating: pods "test-747555b669-" is forbidden: required scc/custom-restricted-v2-scc, custom-privileged-scc not found
Actual results:
Pod deployment failed with "Error creating: pods "test-747555b669-" is forbidden: required scc/custom-restricted-v2-scc, custom-privileged-scc not found"
Expected results:
Either it should ignore the second scc instead of "not found" or it should show a proper error message
Additional info:
This is a clone of issue OCPBUGS-17589. The following is the description of the original issue:
—
This bug has been seen during the analysis of another issue
If the Server Internal IP is not defined, CBO crashes as nil is not handled in https://github.com/openshift/cluster-baremetal-operator/blob/release-4.12/provisioning/utils.go#L99
I0809 17:33:09.683265 1 provisioning_controller.go:540] No Machines with cluster-api-machine-role=master found, set provisioningMacAddresses if the metal3 pod fails to start I0809 17:33:09.690304 1 clusteroperator.go:217] "new CO status" reason=SyncingResources processMessage="Applying metal3 resources" message="" I0809 17:33:10.488862 1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.1779c769624884f4 dummy 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:ValidatingWebhookConfigurationUpdated,Message:Updated ValidatingWebhookConfiguration.admissionregistration.k8s.io/baremetal-operator-validating-webhook-configuration because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2023-08-09 17:33:10.488745204 +0000 UTC m=+5.906952556,LastTimestamp:2023-08-09 17:33:10.488745204 +0000 UTC m=+5.906952556,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1768fd4] goroutine 574 [running]: github.com/openshift/cluster-baremetal-operator/provisioning.getServerInternalIP({0x1e774d0?, 0xc0001e8fd0?}) /go/src/github.com/openshift/cluster-baremetal-operator/provisioning/utils.go:75 +0x154 github.com/openshift/cluster-baremetal-operator/provisioning.GetIronicIP({0x1ea2378?, 0xc000856840?}, {0x1bc1f91, 0x15}, 0xc0004c4398, {0x1e774d0, 0xc0001e8fd0}) /go/src/github.com/openshift/cluster-baremetal-operator/provisioning/utils.go:98 +0xfb
Description of problem:
Reported in https://github.com/openshift/cluster-ingress-operator/issues/911
When you open a new issue, it still directs you to Bugzilla, and then doesn't work.
It can be changed here: https://github.com/openshift/cluster-ingress-operator/blob/master/.github/ISSUE_TEMPLATE/config.yml
, but to what?
The correct Jira link is
https://issues.redhat.com/secure/CreateIssueDetails!init.jspa?pid=12332330&issuetype=1&components=12367900&priority=10300&customfield_12316142=26752
But can the public use this mechanism? Yes - https://redhat-internal.slack.com/archives/CB90SDCAK/p1682527645965899
Version-Release number of selected component (if applicable):
n/a
How reproducible:
May be in other repos too.
Steps to Reproduce:
1. Open Issue in the repo - click on New Issue 2. Follow directions and click on link to open Bugzilla 3. Get message that this doesn't work anymore
Actual results:
You get instructions that don't work to open a bug from an Issue.
Expected results:
You get instructions to just open an Issue, or get correct instructions on how to open a bug using Jira.
Additional info:
Description of problem:
In HA mode there are two dedicated nodes, ignition-server-proxy and konnectivity-server only have one replica, I expect that they have two replicas, each runs on one dedicated node.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. allocate two dedicated nodes 2. create a cluster in HA mode 3. check ignition-server-proxy and konnectivity-server in control plane
Actual results:
ignition-server-proxy and konnectivity-server have one replica
Expected results:
ignition-server-proxy and konnectivity-server have two replicas, each replica runs on one dedicated node
Additional info:
Description of problem:
More than one cluster can be created in openshift-cluster-api
$ oc get cluster NAME PHASE AGE VERSION ci-ln-kv1gj4b-72292-jn4rw Provisioning 19m ci-ln-kv1gj4b-72292-jn4rw-1 Provisioning 7s
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2022-11-25-204445
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
More than one cluster can be created in openshift-cluster-api $ oc get cluster NAME PHASE AGE VERSION ci-ln-kv1gj4b-72292-jn4rw Provisioning 19m ci-ln-kv1gj4b-72292-jn4rw-1 Provisioning 7s
Expected results:
The cluster-api namespace to be only the cluster you're running on, and allow users to use cluster API for creating other clusters only in other namespaces
Additional info:
Related to https://issues.redhat.com/browse/OCPBUGS-1493
Description of the problem:
When machines have multiple IP addresses assigned to the same network interface the assisted service will create the bare metal host configuration using the first IP address of the interface. That IP address may or may not be inside the machine CIDR of the cluster. If it isn't then the bare metal host will have an IP address that is different to the IP address of the corresponding node. As a result of that the machine operator will not link the machine and the node, and the machine will never move to the `Running` phase. In that situation the corresponding machine pool will never have the minimum required number of replicas. For worker machine pools that means that the cluster will never be considered completely installed.
How reproducible:
Note that this easy to reproduce using the current zero touch provisioning factory workflow, because when machines have a single NIC they will have two IP addresses assigned. May be harder to reproduce in other scenarios.
Steps to reproduce:
1. Create a bare metal cluster with three control plane nodes and one worker node, where nodes have one NIC and two IP addresses assigned to that NIC. In the ZTPFW scenario that will be a static IP address in the 192.168.7.0/24 range (which is the machine CIDR of the cluster) and another IP address assigned via DHCP, say in the 192.168.150.0/24 range (whic is not the machine CIDR of the cluster).
2. Stat the installation.
3. Check the manifests generated by the assisted service, in particular the `99_openshift-cluster-api_hosts-*.yaml` files. Those will contain the definition of the bare metal hosts, together with a `baremetalhost.metal3.io/status` annotation that contains the status that they should have. Check that it contains the wrong IP address in the 192.168.150.0/24 range, outside of the machine CIDR of the cluster.
4. Check that all the machines (oc get machine -A) didn't move to the `Running` phase. That is because the machine API operator can't link them to the nodes due to the missmatching IP addresses: nodes have 192.168.7.* and machines have 192.168.150.* (copied from the bare metal hosts).
5. Check that the worker machine pool doesn't have the minimum required number of replicas.
6. Check that the installation doesn't complete.
Actual results:
The machines aren't in the `Running` phase, the worker pool doesn't have the minimum required number of replicas and the installation doesn't complete.
Expected results:
All the machines should move to the `Running` phase, the worker pool should have the minimum required number of replicas and the installation should complete.
Description of problem:
Agent create sub-command is showing fatal error when executing invalid command.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Execute `openshift-install agent create invalid`
Actual results:
FATA[0000] Error executing openshift-install: accepts 0 arg(s), received 1
Expected results:
It should return the help of the create command.
Additional info:
As a developer, I would like a Make file command that performs all the pre-commit checks that should be run before committing any code to GitHub. This includes updating Golang and API dependencies, building the source code, building the e2e's, verifying source code formatting, and running unit tests.
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/62
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
What happens:
When deploying OpenShift 4.13 with Failure Domains, the PrimarySubnet in the ProviderSpec of the Machine is set to the MachinesSubnet set in install-config.yaml.
What is expected:
Machines in failure domains with a control-plane port target should not use the MachinesSubnet as a primary subnet in the provider spec. it should be the ID of the subnet that is actually used for the control plane on that domain.
How to reproduce:
install-config.yaml:
apiVersion: v1 baseDomain: shiftstack.com compute: - name: worker platform: openstack: type: m1.xlarge replicas: 1 controlPlane: name: master platform: openstack: type: m1.xlarge failureDomains: - portTargets: - id: control-plane network: id: fb6f8fea-5063-4053-81b3-6628125ed598 fixedIPs: - subnet: id: b02175dd-95c6-4025-8ff3-6cf6797e5f86 computeAvailabilityZone: nova-az1 storageAvailabilityZone: cinder-az1 - portTargets: - id: control-plane network: id: 9a5452a8-41d9-474c-813f-59b6c34194b6 fixedIPs: - subnet: id: 5fe5b54a-217c-439d-b8eb-441a03f7636d computeAvailabilityZone: nova-az1 storageAvailabilityZone: cinder-az1 - portTargets: - id: control-plane network: id: 3ed980a6-6f8e-42d3-8500-15f18998c434 fixedIPs: - subnet: id: a7d57db6-f896-475f-bdca-c3464933ec02 computeAvailabilityZone: nova-az1 storageAvailabilityZone: cinder-az1 replicas: 3 metadata: name: mycluster networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 192.168.10.0/24 - cidr: 192.168.20.0/24 - cidr: 192.168.30.0/24 - cidr: 192.168.72.0/24 - cidr: 192.168.100.0/24 platform: openstack: cloud: foch_openshift machinesSubnet: b02175dd-95c6-4025-8ff3-6cf6797e5f86 apiVIPs: - 192.168.100.240 ingressVIPs: - 192.168.100.250 loadBalancer: type: UserManaged featureSet: TechPreviewNoUpgrade
Machine spec:
Provider Spec:
Value:
API Version: machine.openshift.io/v1alpha1
Cloud Name: openstack
Clouds Secret:
Name: openstack-cloud-credentials
Namespace: openshift-machine-api
Flavor: m1.xlarge
Image: foch-bgp-2fnjz-rhcos
Kind: OpenstackProviderSpec
Metadata:
Creation Timestamp: <nil>
Networks:
Filter:
Subnets:
Filter:
Id: 5fe5b54a-217c-439d-b8eb-441a03f7636d
Uuid: 9a5452a8-41d9-474c-813f-59b6c34194b6
Primary Subnet: b02175dd-95c6-4025-8ff3-6cf6797e5f86
Security Groups:
Filter:
Name: foch-bgp-2fnjz-master
Filter:
Uuid: 1b142123-c085-4e14-b03a-cdf5ef028d91
Server Group Name: foch-bgp-2fnjz-master
Server Metadata:
Name: foch-bgp-2fnjz-master
Openshift Cluster ID: foch-bgp-2fnjz
Tags:
openshiftClusterID=foch-bgp-2fnjz
Trunk: true
User Data Secret:
Name: master-user-data
Status:
Addresses:
Address: 192.168.20.20
Type: InternalIP
Address: foch-bgp-2fnjz-master-1
Type: Hostname
Address: foch-bgp-2fnjz-master-1
Type: InternalDNS
The machine is connected to the right subnet, but has a wrong PrimarySubnet configured.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
It is better for pod-security admission config to use v1 like upstream instead of still using v1beta1
Version-Release number of selected component (if applicable):
4.12, 4.13
How reproducible:
Always
Steps to Reproduce:
1. In upstream, when it was 1.24, https://v1-24.docs.kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-admission-controller/#configure-the-admission-controller shows "pod-security.admission.config.k8s.io/v1beta1".
When it was 1.25 (OCP 4.12), https://v1-25.docs.kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-admission-controller/#configure-the-admission-controller does not show "shows pod-security.admission.config.k8s.io/v1beta1" any longer. In the bottom, it notes: pod-security.admission.config.k8s.io/v1 configuration requires v1.25+. For v1.23 and v1.24, use v1beta1.
In OCP 4.12 (1.25) and 4.13 (1.26), it is still v1beta1, we'd better to align with upstream:
4.12: $ oc version .. Server Version: 4.12.9 Kubernetes Version: v1.25.7+eab9cc9 $ jq "" $(oc extract cm/config -n openshift-kube-apiserver --confirm) | jq '.admission.pluginConfig.PodSecurity' { "configuration": { "apiVersion": "pod-security.admission.config.k8s.io/v1beta1", "defaults": { "audit": "restricted", "audit-version": "latest", "enforce": "privileged", "enforce-version": "latest", "warn": "restricted", "warn-version": "latest" }, "exemptions": { "usernames": [ "system:serviceaccount:openshift-infra:build-controller" ] }, "kind": "PodSecurityConfiguration" } } 4.13: $ oc version ... Server Version: 4.13.0-0.nightly-2023-03-23-204038 Kubernetes Version: v1.26.2+dc93b13 $ jq "" $(oc extract cm/config -n openshift-kube-apiserver --confirm) | jq '.admission.pluginConfig.PodSecurity' { "configuration": { "apiVersion": "pod-security.admission.config.k8s.io/v1beta1", "defaults": { "audit": "restricted", "audit-version": "latest", "enforce": "privileged", "enforce-version": "latest", "warn": "restricted", "warn-version": "latest" }, "exemptions": { "usernames": [ "system:serviceaccount:openshift-infra:build-controller" ] }, "kind": "PodSecurityConfiguration" } }
Actual results:
See above.
Expected results:
It is better for pod-security admission config to align with upstream to use v1 than v1beta1.
Additional info:
Description of problem:
InfraStructureRef* is dereferenced without checking for nil value
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Run TechPreview cluster 2. Try to create Cluster object with empty spec apiVersion: cluster.x-k8s.io/v1beta1 kind: Cluster metadata: name: example namespace: openshift-cluster-api spec: {} 3. Observe panic in cluster-capi-operator
Actual results:
2023/03/10 14:13:31 http: panic serving 10.129.0.2:39614: runtime error: invalid memory address or nil pointer dereference goroutine 3619 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1850 +0xbf panic({0x16cada0, 0x2948bc0}) /usr/lib/golang/src/runtime/panic.go:890 +0x262 github.com/openshift/cluster-capi-operator/pkg/webhook.(*ClusterWebhook).ValidateCreate(0xc000ceac00?, {0x24?, 0xc00090fff0?}, {0x1b72d68?, 0xc0010831e0?}) /go/src/github.com/openshift/cluster-capi-operator/pkg/webhook/cluster.go:32 +0x39 sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*validatorForType).Handle(_, {_, _}, {{{0xc000ceac00, 0x24}, {{0xc00090fff0, 0x10}, {0xc000838000, 0x7}, {0xc000838007, ...}}, ...}}) /go/src/github.com/openshift/cluster-capi-operator/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/validator_custom.go:79 +0x2dd sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle(_, {_, _}, {{{0xc000ceac00, 0x24}, {{0xc00090fff0, 0x10}, {0xc000838000, 0x7}, {0xc000838007, ...}}, ...}}) /go/src/github.com/openshift/cluster-capi-operator/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/webhook.go:169 +0xfd sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).ServeHTTP(0xc000630e80, {0x7f26f94b5580?, 0xc000f80280}, 0xc000750800) /go/src/github.com/openshift/cluster-capi-operator/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/http.go:98 +0xeb5 github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerInFlight.func1({0x7f26f94b5580, 0xc000f80280}, 0x1b7ff00?) /go/src/github.com/openshift/cluster-capi-operator/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:60 +0xd4 net/http.HandlerFunc.ServeHTTP(0x1b7ffb0?, {0x7f26f94b5580?, 0xc000f80280?}, 0x7afe60?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1({0x1b7ffb0?, 0xc000a72000?}, 0xc000750800) /go/src/github.com/openshift/cluster-capi-operator/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:146 +0xb8 net/http.HandlerFunc.ServeHTTP(0x0?, {0x1b7ffb0?, 0xc000a72000?}, 0xc00056f0e1?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2({0x1b7ffb0, 0xc000a72000}, 0xc000750800) /go/src/github.com/openshift/cluster-capi-operator/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:108 +0xbf net/http.HandlerFunc.ServeHTTP(0xc000a72000?, {0x1b7ffb0?, 0xc000a72000?}, 0x18e45d1?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.(*ServeMux).ServeHTTP(0xc00056f0c0?, {0x1b7ffb0, 0xc000a72000}, 0xc000750800) /usr/lib/golang/src/net/http/server.go:2487 +0x149 net/http.serverHandler.ServeHTTP({0x1b71dc8?}, {0x1b7ffb0, 0xc000a72000}, 0xc000750800) /usr/lib/golang/src/net/http/server.go:2947 +0x30c net/http.(*conn).serve(0xc00039af00, {0x1b81198, 0xc000416c00}) /usr/lib/golang/src/net/http/server.go:1991 +0x607 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3102 +0x4db
Expected results:
Webhook returns error, but does not panic
Additional info:
Description of problem:
On August 24th, a bugfix was merged into the hypershift repo to address OCPBUGS-16813 (https://github.com/openshift/hypershift/pull/2942). This resulted in a change in the konnectivity server with the HCP namespace. The change is that we went from a single konnectivity server to multiple when HA hcps are in use. The konnectivity agents within the HCP worker nodes connect to the server through a route. When connecting through this route, the agents on the worker are supposed to discover all the HA konnectivity servers through round robin load balancing, meaning if the agents try to connect to the route endpoint enough times, the theory is that they should eventually discover all the servers. With the kubevirt platform, only a single konnectivity server is discovered by the agents in the worker nodes, which leads to the inability for the kas on the HCP to reliably contact kubelets within the worker nodes. The outcome of this issue is that webhooks (and other connections that require the kas (api server) in the HCP to contact worker nodes) to fail the majority of the time.
Version-Release number of selected component (if applicable):
How reproducible:
create a kubevirt platform HCP using the `hcp` cli tool. This will default to HA mode, and the cluster will never fully roll out. The ingress, monitoring, and console clusteroperators will flap back and forth between failing and success. Usually we'll see an error about webhook connectivity failing. During this time, any `oc` command that attempts to tunnel a connection through the kas to the kubelets will fail the majority of the time. This means `oc logs`, `oc exec`, etc... will not work. Actual results:{code:none} kas -> kubelet connections are unreliable
Expected results:
kas -> kubelet connections are reliable
Additional info:
Description of problem:
Update cpms vmSize on ASH, got error "The value 1024 of parameter 'osDisk.diskSizeGB' is out of range. The value must be between '1' and '1023', inclusive." Target="osDisk.diskSizeGB"when provisioning new control plane node, change diskSizeGB to 1023, new nodes are provisioned. But for fresh install, the default diskSizeGB is 1024 for master.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-01-27-165107
How reproducible:
Always
Steps to Reproduce:
1. Update cpms vmSize to Standard_DS3_v2 2. Check new machine state $ oc get machine NAME PHASE TYPE REGION ZONE AGE jima28b-r9zht-master-h7g67-1 Running Standard_DS5_v2 mtcazs 11h jima28b-r9zht-master-hhfzl-0 Failed 24s jima28b-r9zht-master-qtb9j-0 Running Standard_DS5_v2 mtcazs 11h jima28b-r9zht-master-tprc7-2 Running Standard_DS5_v2 mtcazs 11h $ oc get machine jima28b-r9zht-master-hhfzl-0 -o yaml errorMessage: 'failed to reconcile machine "jima28b-r9zht-master-hhfzl-0": failed to create vm jima28b-r9zht-master-hhfzl-0: failure sending request for machine jima28b-r9zht-master-hhfzl-0: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="The value 1024 of parameter ''osDisk.diskSizeGB'' is out of range. The value must be between ''1'' and ''1023'', inclusive." Target="osDisk.diskSizeGB"' errorReason: InvalidConfiguration lastUpdated: "2023-01-29T02:35:13Z" phase: Failed providerStatus: conditions: - lastTransitionTime: "2023-01-29T02:35:13Z" message: 'failed to create vm jima28b-r9zht-master-hhfzl-0: failure sending request for machine jima28b-r9zht-master-hhfzl-0: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="The value 1024 of parameter ''osDisk.diskSizeGB'' is out of range. The value must be between ''1'' and ''1023'', inclusive." Target="osDisk.diskSizeGB"' reason: MachineCreationFailed status: "False" type: MachineCreated metadata: {} 3. Checke logs $ oc logs -f machine-api-controllers-84444d49f-mlldl -c machine-controller I0129 02:35:15.047784 1 recorder.go:103] events "msg"="InvalidConfiguration: failed to reconcile machine \"jima28b-r9zht-master-hhfzl-0\": failed to create vm jima28b-r9zht-master-hhfzl-0: failure sending request for machine jima28b-r9zht-master-hhfzl-0: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code=\"InvalidParameter\" Message=\"The value 1024 of parameter 'osDisk.diskSizeGB' is out of range. The value must be between '1' and '1023', inclusive.\" Target=\"osDisk.diskSizeGB\"" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jima28b-r9zht-master-hhfzl-0","uid":"6cb07114-41a6-40bc-8e83-d9f27931bc8c","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"451889"} "reason"="FailedCreate" "type"="Warning" $ oc logs -f control-plane-machine-set-operator-69b756df4f-skv4x E0129 02:35:13.282358 1 controller.go:818] "msg"="Observed failed replacement control plane machines" "error"="found replacement control plane machines in an error state, the following machines(s) are currently reporting an error: jima28b-r9zht-master-hhfzl-0" "controller"="controlplanemachineset" "failedReplacements"="jima28b-r9zht-master-hhfzl-0" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="a988d699-8ddc-4880-9930-0db64ca51653" I0129 02:35:13.282380 1 controller.go:264] "msg"="Cluster state is degraded. The control plane machine set will not take any action until issues have been resolved." "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="a988d699-8ddc-4880-9930-0db64ca51653" 4. Change diskSizeGB to 1023, new machine Provisioned. osDisk: diskSettings: {} diskSizeGB: 1023 $ oc get machine NAME PHASE TYPE REGION ZONE AGE jima28b-r9zht-master-h7g67-1 Running Standard_DS5_v2 mtcazs 11h jima28b-r9zht-master-hhfzl-0 Deleting 7m1s jima28b-r9zht-master-qtb9j-0 Running Standard_DS5_v2 mtcazs 12h jima28b-r9zht-master-tprc7-2 Running Standard_DS5_v2 mtcazs 11h jima28b-r9zht-worker-mtcazs-p8d79 Running Standard_DS3_v2 mtcazs 18h jima28b-r9zht-worker-mtcazs-x5gvh Running Standard_DS3_v2 mtcazs 18h jima28b-r9zht-worker-mtcazs-xmdvw Running Standard_DS3_v2 mtcazs 18h $ oc get machine NAME PHASE TYPE REGION ZONE AGE jima28b-r9zht-master-h7g67-1 Running Standard_DS5_v2 mtcazs 11h jima28b-r9zht-master-qtb9j-0 Running Standard_DS5_v2 mtcazs 12h jima28b-r9zht-master-tprc7-2 Running Standard_DS5_v2 mtcazs 11h jima28b-r9zht-master-vqd7r-0 Provisioned Standard_DS3_v2 mtcazs 16s jima28b-r9zht-worker-mtcazs-p8d79 Running Standard_DS3_v2 mtcazs 18h jima28b-r9zht-worker-mtcazs-x5gvh Running Standard_DS3_v2 mtcazs 18h jima28b-r9zht-worker-mtcazs-xmdvw Running Standard_DS3_v2 mtcazs 18h
Actual results:
For fresh install, the default diskSizeGB is 1024 for master. But update cpms vmSize, new master was created failed, report error "The value 1024 of parameter ''osDisk.diskSizeGB'' is out of range. The value must be between ''1'' and ''1023'', inclusive" When changing diskSizeGB to 1023, new machine got Provisioned.
Expected results:
New master could be created when change vmtype, and don't need update diskSizeGB to 1023.
Additional info:
Minimum recommendation for control plane nodes is 1024 GB https://docs.openshift.com/container-platform/4.12/installing/installing_azure_stack_hub/installing-azure-stack-hub-network-customizations.html#installation-azure-stack-hub-config-yaml_installing-azure-stack-hub-network-customizations
Description of problem:
When the releaseImage is a digest, for example quay.io/openshift-release-dev/ocp-release@sha256:bbf1f27e5942a2f7a0f298606029d10600ba0462a09ab654f006ce14d314cb2c, a spurious warning is putput when running openshift-install agent create image Its not calculating the releaseImage properly (see the '@sha' suffix below) so it causes this spurious message WARNING The ImageContentSources configuration in install-config.yaml should have at-least one source field matching the releaseImage value quay.io/openshift-release-dev/ocp-release@sha256 This can cause confusion for users.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Every time when using a release image with a digest is used
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Not able to convert a deployment to a Serverless as Make Serverless form in the console is broken.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Steps to Reproduce:
1. Create a deployment using a Container image flow 2. Select Make Serverless option from the topology actions menu of the created deployment 3.
Actual results:
After clicking on create it throw an error
Expected results:
Should create a Serverless resource.
Additional info:
Description of problem:
OpenStack features SG rules opening traffic from `0.0.0.0/0` on NodePorts. This was required for the OVN loadbalancers to work properly as they keep the source IP of the traffic when traffic reaches the LB members. This isn't needed anymore as in 4.14 OSASINFRA-3067 implemented and enabled `manage-security-groups` option on the cloud-provider-openstack, so that it will create and attach the proper SG on its own to make sure only necessary NodePorts are open.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Check for existence of rules opening traffic from 0.0.0.0/0 on the master and worker nodes.
Actual results:
Rules are still there.
Expected results:
Rules are not needed anymore.
Additional info:
Description of the problem:
According to swagger.yaml cpu_architecture in infra-envs can include 'multi', but that only makes sense in the cluster entity.
(Slack thread: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1680095368006089)
How reproducible:
100%
Steps to reproduce:
1. Check out the swagger.yaml here
Actual results:
enum: ['x86_64', 'aarch64', 'arm64','ppc64le','s390x','multi']
Expected results:
enum: ['x86_64', 'aarch64', 'arm64','ppc64le','s390x']
Description of problem:
Running `openshift-install cluster destroy` defeats an OpenStack cloud with many Swift objects, if said cloud is low on resources. In particular, testing the teardown of an OCP cluster with 500.000 objects in the image registry caused RabbitMQ to crash on a standalone (single-host) OpenStack deployment backed with NVMe storage.
Version-Release number of selected component (if applicable):
How reproducible:
on a constrained (single-host) OpenStack cloud, with the default limit of 10000 to the bulk-deletion of Swift objects.
Steps to Reproduce:
1. install OpenShift 2. upload 500000 arbitrary objects in the image-registry container 3. launch cluster teardown 4. enjoy Swift responding 504 errors, and the rest of the cluster to become unstable
Description of problem:
Ingress operator is constantly reverting Internal Services when it detects a service change that are default values.
Version-Release number of selected component (if applicable):
4.13, 4.14
How reproducible:
100%
Steps to Reproduce:
1. Create an ingress controller 2. Watch ingress operator logs for excess updates "updated internal service" [I'll provide a more specific reproducer if needed]
Actual results:
Excess: 2023-05-04T02:08:02.331Z INFO operator.ingress_controller ingress/internal_service.go:44 updated internal service ...
Expected results:
No updates
Additional info:
The diff looks like: 2023-05-05T15:12:06.668Z INFO operator.ingress_controller ingress/internal_service.go:44 updated internal service {"namespace": "openshift-ingress", "name": "router-internal-default", "diff": " &v1.Service{ TypeMeta: {}, ObjectMeta: {Name: \"router-internal-default\", Namespace: \"openshift-ingress\", UID: \"815f1499-a4d4-4cb8-9a5b-9905580e0ffd\", ResourceVersion: \"8031\", ...}, Spec: v1.ServiceSpec{ Ports: {{Name: \"http\", Protocol: \"TCP\", Port: 80, TargetPort: {Type: 1, StrVal: \"http\"}, ...}, {Name: \"https\", Protocol: \"TCP\", Port: 443, TargetPort: {Type: 1, StrVal: \"https\"}, ...}, {Name: \"metrics\", Protocol: \"TCP\", Port: 1936, TargetPort: {Type: 1, StrVal: \"metrics\"}, ...}}, Selector: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\"}, ClusterIP: \"172.30.56.107\", - ClusterIPs: []string{\"172.30.56.107\"}, + ClusterIPs: nil, Type: \"ClusterIP\", ExternalIPs: nil, - SessionAffinity: \"None\", + SessionAffinity: \"\", LoadBalancerIP: \"\", LoadBalancerSourceRanges: nil, ... // 3 identical fields PublishNotReadyAddresses: false, SessionAffinityConfig: nil, - IPFamilies: []v1.IPFamily{\"IPv4\"}, + IPFamilies: nil, - IPFamilyPolicy: &\"SingleStack\", + IPFamilyPolicy: nil, AllocateLoadBalancerNodePorts: nil, LoadBalancerClass: nil, - InternalTrafficPolicy: &\"Cluster\", + InternalTrafficPolicy: nil, }, Status: {}, } "}
Messing around with unit testing, it looks like internalServiceChanged triggers true when spec.IPFamilies, spec.IPFamilyPolicy, and spec.InternalTrafficPolicy are set to the default values that you see in the diff above.
Ingress operator then resets back to nil, then the API server sets them to their defaults, and this process repeats.
internalServiceChanged should either ignore, or explicitly set these values.
Description of the problem:
In the Create cluster wizard -> Networking page, an error is shown saying that the cluster is not ready yet. The warning message suggests to define the API or Ingress IP but they are already input in the form and in the YAML (see screenshots attached)
Also, the hosts are oscillating between "Pending input" and "Insufficient" states, with the errors shown in the images
Found this error while testing epic MGMT-9907
MCE image 2.3.0-DOWNANDBACK-2023-03-28-23-01-58
Now that https://issues.redhat.com//browse/OCPBUGS-13153 sets a default value of GOMAXPROCS before running node exporter, see https://github.com/openshift/cluster-monitoring-operator/pull/1996
The doc at https://github.com/openshift/cluster-monitoring-operator/blob/45bdf6f0148b771618d0dd89c432e7a1932e7a0a/pkg/manifests/types.go#L289-L295 should be adjusted.
Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/62
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Should not need any special QA.
Description of problem:
The OpenShift DNS daemonset has the rolling update strategy. The "maxSurge" parameter is set to a non zero value which means that the "maxUnavailable" parameter is set to zero. When the user replaces the toleration in the daemonset's template spec (via the OpenShift DNS config API) from the one which helps to be scheduled on the master node into any other toleration: the new pods are still trying to be scheduled on the master nodes. The old pods from the tolerated nodes can be lucky enough to be recreated but only if they go before any pod from the intolerable node. The new pods are not expected to be scheduled on the nodes which are not tolerated by the new damonset's template spec. The daemonset controller should just delete the old pods from the nodes which cannot be tolerated anymore. The old pods from the nodes which can still be tolerated should be recreated according to the rolling update parameters.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create the daemonset which tolerates "node-role.kubernetes.io/master" taint and has the following rolling update parameters:
$ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.updateStrategy rollingUpdate: maxSurge: 10% maxUnavailable: 0 type: RollingUpdate $ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations - key: node-role.kubernetes.io/master operator: Exists
2. Let the daemonset to be scheduled on all the target nodes (e.g. all masters and all workers)
$ oc -n openshift-dns get pods -o wide | grep dns-default dns-default-6bfmf 2/2 Running 0 119m 10.129.0.40 ci-ln-sb5ply2-72292-qlhc8-master-2 <none> <none> dns-default-9cjdf 2/2 Running 0 2m35s 10.129.2.15 ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq <none> <none> dns-default-c6j9x 2/2 Running 0 119m 10.128.0.13 ci-ln-sb5ply2-72292-qlhc8-master-0 <none> <none> dns-default-fhqrs 2/2 Running 0 2m12s 10.131.0.29 ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs <none> <none> dns-default-lx2nf 2/2 Running 0 119m 10.130.0.15 ci-ln-sb5ply2-72292-qlhc8-master-1 <none> <none> dns-default-mmc78 2/2 Running 0 112m 10.128.2.7 ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk <none> <none>
3. Update the daemonset's tolerations by removing "node-role.kubernetes.io/master" and adding any other toleration (not existing works too):
$ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations - key: test-taint operator: Exists
Actual results:
$ oc -n openshift-dns get pods -o wide | grep dns-default dns-default-6bfmf 2/2 Running 0 124m 10.129.0.40 ci-ln-sb5ply2-72292-qlhc8-master-2 <none> <none> dns-default-76vjz 0/2 Pending 0 3m2s <none> <none> <none> <none> dns-default-9cjdf 2/2 Running 0 7m24s 10.129.2.15 ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq <none> <none> dns-default-c6j9x 2/2 Running 0 124m 10.128.0.13 ci-ln-sb5ply2-72292-qlhc8-master-0 <none> <none> dns-default-fhqrs 2/2 Running 0 7m1s 10.131.0.29 ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs <none> <none> dns-default-lx2nf 2/2 Running 0 124m 10.130.0.15 ci-ln-sb5ply2-72292-qlhc8-master-1 <none> <none> dns-default-mmc78 2/2 Running 0 117m 10.128.2.7 ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk <none> <none>
Expected results:
$ oc -n openshift-dns get pods -o wide | grep dns-default dns-default-9cjdf 2/2 Running 0 7m24s 10.129.2.15 ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq <none> <none> dns-default-fhqrs 2/2 Running 0 7m1s 10.131.0.29 ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs <none> <none> dns-default-mmc78 2/2 Running 0 7m54s 10.128.2.7 ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk <none> <none>
Additional info:
Upstream issue: https://github.com/kubernetes/kubernetes/issues/118823
Slack discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1687455135950439
Description of problem:
We shouldn't enforce PSa in 4.14, neither by label sync, neither by global cluster config.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
As a cluster admin: 1. create two new namespaces/projects: pokus, openshift-pokus 2. as a cluster-admin, attempt to create a privileged pod in both the namespaces from 1.
Actual results:
pod creation is blocked by pod security admission
Expected results:
only a warning about pod violating the namespace pod security level should be emitted
Additional info:
Description of problem:
When you have a HCP running and it's creating the HostedCluster pods it renders this IgnitionProxy config: defaults mode http timeout connect 5s timeout client 30s timeout server 30s frontend ignition-server bind *:8443 ssl crt /tmp/tls.pem default_backend ignition_servers backend ignition_servers server ignition-server ignition-server:443 check ssl ca-file /etc/ssl/root-ca/ca.crt This Configuration is not supported on Ipv6 causing the worker nodes to fail downloading the Ignition Payload
Version-Release number of selected component (if applicable):
MCE 2.4 OCP 4.14
How reproducible:
Always
Steps to Reproduce:
1. Create a HostedCluster with the networking parameters set to IPv6 networks. 2. Check the IgnitionProxy config using: oc rsh <pod> cat /tmp/haproxy.conf
Actual results:
Agent pod in the destination workers fails with: Jul 26 10:23:44 localhost.localdomain next_step_runne[4242]: time="26-07-2023 10:23:44" level=error msg="ignition file download failed: request failed: Get \"https://ignition-server-clusters-hosted.apps.ocp-edge-cluster-0.qe.lab.redhat.com/ignition\": EOF" file="apivip_check.go:160"
Expected results:
The worker should download the ignition payload properly
Additional info:
N/A
4.14 e2e-metal-ipi jobs are failing with
: [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
This is the alert that is firing,
promQL query returned unexpected results:
ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards|KubeJobFailed|Watchdog|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|KubePodNotReady|etcdMembersDown|etcdMembersDown|etcdGRPCRequestsSlow|etcdGRPCRequestsSlow|etcdHighNumberOfFailedGRPCRequests|etcdHighNumberOfFailedGRPCRequests|etcdMemberCommunicationSlow|etcdMemberCommunicationSlow|etcdNoLeader|etcdNoLeader|etcdHighFsyncDurations|etcdHighFsyncDurations|etcdHighCommitDurations|etcdHighCommitDurations|etcdInsufficientMembers|etcdInsufficientMembers|etcdHighNumberOfLeaderChanges|etcdHighNumberOfLeaderChanges|KubeAPIErrorBudgetBurn|KubeAPIErrorBudgetBurn|KubeClientErrors|KubeClientErrors|KubePersistentVolumeErrors|KubePersistentVolumeErrors|MCDDrainError|MCDDrainError|MCDPivotError|MCDPivotError|PrometheusOperatorWatchErrors|PrometheusOperatorWatchErrors|RedhatOperatorsCatalogError|RedhatOperatorsCatalogError|VSphereOpenshiftNodeHealthFail|VSphereOpenshiftNodeHealthFail|SamplesImagestreamImportFailing|SamplesImagestreamImportFailing",alertstate="firing",severity!="info"} >= 1
[
{
"metric":
,
"value": [
1680670057.374,
"1"
]
},
Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/37
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Currently, we unconditionally use an image mapping from the management cluster if a mapping exists for ocp-release-dev or ocp/release. When the individual images do not use those registries, the wrong mapping is used.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1.Create an ICSP on a management cluster: apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: name: image-policy-39 spec: repositoryDigestMirrors: - mirrors: - quay.io/openshift-release-dev/ocp-release - pull.q1w2.quay.rhcloud.com/openshift-release-dev/ocp-release source: quay.io/openshift-release-dev/ocp-release 2. Create a HostedCluster that uses a CI release
Actual results:
Nodes never join because ignition server is looking up the wrong image for the CCO and MCO.
Expected results:
Nodes can join the cluster.
Additional info:
Please review the following PR: https://github.com/openshift/coredns/pull/89
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-11788.
Description of problem:
TRT has identified a likely regression in Metal IPv6 installations. 4.14 installs are statistically worse than 4.13. We are working on a new tool called Component Readiness that does cross-release comparisons to ensure nothing get worse. I think it has found something in metal. At GA, 4.13 metal installs for ipv6 upgrade micro jobs were 100%. They are now around 89% in 4.14. All the failures seem to have the same mode where no workers come up, with PXE errors in the serial console. !image-2023-06-06-10-13-13-310.png|thumbnail! You can view the report here: https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&baseEndTime=2023-05-16%2023%3A59%3A59&baseRelease=4.13&baseStartTime=2023-04-18%2000%3A00%3A00&capability=Other&component=Installer%20%2F%20openshift-installer&confidence=95&environment=ovn%20upgrade-micro%20amd64%20metal-ipi%20standard&excludeArches=arm64&excludeClouds=alibaba%2Cibmcloud%2Clibvirt%2Covirt&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&pity=5&platform=metal-ipi&sampleEndTime=2023-06-06%2023%3A59%3A59&sampleRelease=4.14&sampleStartTime=2023-05-09%2000%3A00%3A00&testId=cluster%20install%3A0cb1bb27e418491b1ffdacab58c5c8c0&testName=install%20should%20succeed%3A%20overall&upgrade=upgrade-micro&variant=standard The serial console on the workers shows PXE errors: >>Start PXE over IPv4. PXE-E18: Server response timeout. BdsDxe: failed to load Boot0001 "UEFI PXEv4 (MAC:00962801D023)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(00962801D023,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0): Not Found >>Start PXE over IPv6.. Station IP address is FD00:1101:0:0:2EE1:8456:96FB:68B1 Server IP address is FD00:1101:0:0:0:0:0:3 NBP filename is snponly.efi NBP filesize is 0 Bytes PXE-E18: Server response timeout. BdsDxe: failed to load Boot0002 "UEFI PXEv6 (MAC:00962801D023)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(00962801D023,0x1)/IPv6(0000:0000:0000:0000:0000:0000:0000:0000,0x0,Static,0000:0000:0000:0000:0000:0000:0000:0000,0x40,0000:0000:0000:0000:0000:0000:0000:0000): Not Found >>Start HTTP Boot over IPv4. Error: Could not retrieve NBP file size from HTTP server. Error: Server response timeout. BdsDxe: failed to load Boot0003 "UEFI HTTPv4 (MAC:00962801D023)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(00962801D023,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)/Uri(): Not Found >>Start HTTP Boot over IPv6.. Error: Could not retrieve NBP file size from HTTP server. Error: Remote boot cancelled. BdsDxe: failed to load Boot0004 "UEFI HTTPv6 (MAC:00962801D023)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(00962801D023,0x1)/IPv6(0000:0000:0000:0000:0000:0000:0000:0000,0x0,Static,0000:0000:0000:0000:0000:0000:0000:0000,0x40,0000:0000:0000:0000:0000:0000:0000:0000)/Uri(): Not Found BdsDxe: No bootable option or device was found. BdsDxe: Press any key to enter the Boot Manager Menu.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
10%
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Example failures: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6/1665428719952465920 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6/1664711616538611712 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6/1664645418744549376 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6/1663915360878858240
Description of the problem:
Creating a host without any disks will cause the following error log message without any indicative error message displayed to the user.
In this case the status remains Discovering and the user cannot know what the issue is.
Log from the service:
time="2023-06-07T12:36:09Z" level=error msg="failed to create new validation context for host e0b465cc-e91f-4ca6-9594-27052a9a6f28" func="github.com/openshift/assisted-service/internal/host.(*Manager).IsValidMasterCandidate" file="/assisted-service/internal/host/host.go:1280" error="Inventory is not valid" pkg=cluster-state
Example inventory:
{ "bmc_address": "0.0.0.0", "bmc_v6address": ":: /0", "boot": { "current_boot_mode": "uefi" }, "cpu": { "architecture": "x86_64", "count": 8, "flags": [ "fpu", "vme", "de", "pse", "tsc", "msr", "pae", "mce", "cx8", "apic", "sep", "mtrr", "pge", "mca", "cmov", "pat", "pse36", "clflush", "mmx", "fxsr", "sse", "sse2", "ht", "syscall", "nx", "mmxext", "fxsr_opt", "pdpe1gb", "rdtscp", "lm", "rep_good", "nopl", "cpuid", "extd_apicid", "tsc_known_freq", "pni", "pclmulqdq", "ssse3", "fma", "cx16", "pcid", "sse4_1", "sse4_2", "x2apic", "movbe", "popcnt", "tsc_deadline_timer", "aes", "xsave", "avx", "f16c", "rdrand", "hypervisor", "lahf_lm", "cmp_legacy", "cr8_legacy", "abm", "sse4a", "misalignsse", "3dnowprefetch", "osvw", "topoext", "perfctr_core", "ssbd", "ibrs", "ibpb", "stibp", "vmmcall", "fsgsbase", "tsc_adjust", "bmi1", "avx2", "smep", "bmi2", "rdseed", "adx", "smap", "clflushopt", "clwb", "sha_ni", "xsaveopt", "xsavec", "xgetbv1", "xsaves", "clzero", "xsaveerptr", "wbnoinvd", "arat", "umip", "vaes", "vpclmulqdq", "rdpid", "arch_capabilities" ], "frequency": 2545.214, "model_name": "AMD EPYC 7J13 64-Core Processor" }, "disks": [], "gpus": [ { "address": "0000: 00: 02.0" } ], "hostname": "02-00-17-01-2c-cf", "interfaces": [ { "flags": [ "up", "broadcast", "multicast" ], "has_carrier": true, "ipv4_addresses": [ "10.0.28.205/20" ], "ipv6_addresses": [], "mac_address": "02: 00: 17: 01: 2c: cf", "mtu": 9000, "name": "ens3", "product": "0x101e", "speed_mbps": 50000, "type": "physical", "vendor": "0x15b3" } ], "memory": { "physical_bytes": 17179869184, "physical_bytes_method": "dmidecode", "usable_bytes": 16765730816 }, "routes": [ { "destination": "0.0.0.0", "family": 2, "gateway": "10.0.16.1", "interface": "ens3", "metric": 100 }, { "destination": "10.0.16.0", "family": 2, "interface": "ens3", "metric": 100 }, { "destination": "10.88.0.0", "family": 2, "interface": "cni-podman0" }, { "destination": "169.254.0.0", "family": 2, "interface": "ens3", "metric": 100 }, { "destination": ":: 1", "family": 10, "interface": "lo", "metric": 256 }, { "destination": "fe80:: ", "family": 10, "interface": "cni-podman0", "metric": 256 }, { "destination": "fe80:: ", "family": 10, "interface": "ens3", "metric": 1024 } ], "system_vendor": { "manufacturer": "QEMU", "product_name": "Standard PC (i440FX + PIIX, 1996)", "virtual": true }, "tpm_version": "none" }
Steps to reproduce:
1. Register a new cluster
2. Generate image and deploy nodes without disks
Actual results:
Expected results:
Fail validation if the inventory is invalid.
Description of problem:
`cluster-reader` ClusterRole should have ["get", "list", "watch"] permissions for a number of privileged CRs, but lacks them for the API Group "k8s.ovn.org", which includes CRs such as EgressFirewalls, EgressIPs, etc.
Version-Release number of selected component (if applicable):
OCP 4.10 - 4.12 OVN
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster with OVN components, e.g. EgressFirewall 2. Check permissions of ClusterRole `cluster-reader`
Actual results:
No permissions for OVN resources
Expected results:
Get, list, and watch verb permissions for OVN resources
Additional info:
Looks like a similar bug was opened for "network-attachment-definitions" in OCPBUGS-6959 (whose closure is being contested).
Description of problem:
The HostedCluster name is not currently validated against RFC1123.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. 2. 3.
Actual results:
Any HostedCluster name is allowed
Expected results:
Only HostedCluster names meeting RFC1123 validation should be allowed.
Additional info:
Hypershift needs to be able to specify a different release payload for control plane components without redeploying anything in the hosted cluster.
ovnkube-node DaemonSet pods in the hosted cluster and the ovnkube-master pods that run in the control plane both use the same ovn-kubernetes image passed to the CNO.
We need a way to specify these images separately for ovnkube-node and ovnkube-master.
Background:
https://docs.google.com/document/d/1a3tAS_K6lQ2iicjvuIvPIK5lervXFEVQBCAXopBAJ6o/edit
Description of problem:
Coredns template implementations using incorrect Regex for resolving dot [.] character
Version-Release number of selected component (if applicable):
NA
How reproducible:
100% when you use router sharding with domains including apps
Steps to Reproduce:
1. Create an additional IngressRouter with domains names including apps. for ex: example.test-apps.<clustername>.<clusterdomain> 2. Create and configure the external LB corresponding to the additonal IngressController 3. Configure the corporate DNS server and create records for the this additional IngressController resolving to the LB Ip setup in step 2 above. 4. Try resolving the additional domain routes from outside cluster and within cluster, the DNS resolution works fine fro outside cluster. However within cluster all additional domains consisting apps in the domain name resolve to the default ingress VIP instead of their corresponding LB IPs configured on the corportae DNS server. As an alternate and simple test to reroduce you can reproduce it simply by using the dig command on the cluster node with the additinal domain for ex: sh-4.4# dig test.apps-test..<clustername>.<clusterdomain>
Actual results:
DNS resolved all the domains consisting of apps to the defult Ingres VIP for example: example.test-apps.<clustername>.<clusterdomain> resolves to default ingressVIP instead of their actual coresponding LB IP.
Expected results:
DNS should resolve it to coresponding LB IP configured at the DNS server.
Additional info:
The DNS solution is happenng using the CoreFile Templates used on the node. which is treating dot(.) as character instead of actual dot[.] this is a Regex configuration bug inside CoreFile used on Vspehere IPI clusters.
Description of problem:
We currently do some frontend logic to list and search CatalogSources for the source associated with the CSV and Subscription on the CSV details page. If we can't find the CatalogSource, we show an error message and prevent updates from the Subscription tab.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Create an htpasswd idp with any user 2. Create a project admin role binding for this user 3. Install an operator in the namespace where the user has project admin permissions 4. Visit the CSV details page while logged in as the project admin user 5. View the subscriptions tab
Actual results:
An alert is shown indicating the the CatalogSource is missing, and the updates to the operator are prevented.
Expected results:
If the Subscription shows the catalog source as healthy in its status stanza, we shouldn't show an alert or prevent updates.
Additional info:
Reproducing this bug is dependent on the fix for OCPBUGS-3036 which prevents project admin users from viewing the Subscription tab at all.
Description of problem:
While investigating issue [1] we've noticed a few problems with CNO error reporting on the ClusterOperator status [2]: that's fine, but I think there are a couple bugs to write up: 1. when a panic happens, the operator doesnt' go degraded. This can definitely be done 2. when status cannot be updated, the operator should go degraded 3. when service network and/or clusternetwork in status is missing, the operator should go Available=false. [1] https://github.com/openshift/cluster-network-operator/pull/1669 [2] https://coreos.slack.com/archives/CB48XQ4KZ/p1671207248527519?thread_ts=1671197854.825529&cid=CB48XQ4KZ
Version-Release number of selected component (if applicable):
4.13 and previous.
How reproducible:
Always
Steps to Reproduce:
1. Cause a deliberate panic e.g. in the bootstrap code.
Actual results:
Operator keeps getting restarted and is not Degraded.
Expected results:
Operator goes Degraded.
Additional info:
Description of problem:
The advertise address configured for our hcp etcd clusters is not resolvable via DNS (ie. etcd-0.etcd-client.namespace.svc:2379). This impacts some of the etcd tooling that expects to access each member by their advertise address.
Version-Release number of selected component (if applicable):
4.14 (and earlier)
How reproducible:
Always
Steps to Reproduce:
1. Create a HostedCluster and wait for it to come up. 2. Exec into an etcd pod and query cluster endpoint health: $ oc rsh etcd-0 $ etcdctl --cacert /etc/etcd/tls/etcd-ca/ca.crt \ --cert /etc/etcd/tls/server/server.crt \ --key /etc/etcd/tls/server/server.key \ --endpoints https://localhost:2379 \ endpoint health --cluster -w table
Actual results:
An error is returned similar to: {"level":"warn","ts":"2023-08-07T20:40:49.890254Z","logger":"client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000378fc0/etcd-0.etcd-client.clusters-test-cluster.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp: lookup etcd-0.etcd-client.clusters-test-cluster.svc on 172.30.0.10:53: no such host\""}
Expected results:
Actual cluster health is returned: +--------------------------------------------------------------+--------+-------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +--------------------------------------------------------------+--------+-------------+-------+ | https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2379 | true | 9.372168ms | | | https://etcd-2.etcd-discovery.clusters-cewong-guest.svc:2379 | true | 12.269226ms | | | https://etcd-1.etcd-discovery.clusters-cewong-guest.svc:2379 | true | 12.291392ms | | +--------------------------------------------------------------+--------+-------------+-------+
Additional info:
The etcd statefulset is created with spec.serviceName set to `etcd-discovery`. This means that pods in the statefulset, get subdomain set to `etcd-discovery` and names like etcd-0.etcd-discovery.[ns].svc are resolvable. However, the same is not true for the etcd-client service. etcd-0.etcd-client.[ns].svc is not resolvable. The fix would be to change the advertise address of each member to a resolvable name (ie. etcd-0.etcd-discvoery.[ns].svc) and adjust the server certificate to allow those names as well.
Description of problem:
While/after upgrading to 4.11 2023-01-14 CoreDNS has a problem with UDP overflows so DNS lookups are very slow and cause the ingress operator upgrade to stall. We needed to work around with force_tcp following this: https://access.redhat.com/solutions/5984291
Version-Release number of selected component (if applicable):
How reproducible:
100%, but seems to depend on the network environemnt (excact cause unknown)
Steps to Reproduce:
1. install cluster with OKD 4.11-2022-12-02 or earlier 2. initiate upgrade to OKD 4.11-2023-01-14 3. upgrade will stall after upgrading CoreDNS
Actual results:
CoreDNS logs: [ERROR] plugin/errors: 2 oauth-openshift.apps.okd-admin.muc.lv1871.de. AAAA: dns: overflowing header size
Expected results:
Additional info:
Needed for FIPS compliance
Description of problem:
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-17-161027
How reproducible:
Always
Steps to Reproduce:
1. Create a GCP XPN cluster with flexy job template ipi-on-gcp/versioned-installer-xpn-ci, then 'oc descirbe node' 2. Check logs for cloud-network-config-controller pods
Actual results:
% oc get nodes NAME STATUS ROLES AGE VERSION huirwang-0309d-r85mj-master-0.c.openshift-qe.internal Ready control-plane,master 173m v1.26.2+06e8c46 huirwang-0309d-r85mj-master-1.c.openshift-qe.internal Ready control-plane,master 173m v1.26.2+06e8c46 huirwang-0309d-r85mj-master-2.c.openshift-qe.internal Ready control-plane,master 173m v1.26.2+06e8c46 huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal Ready worker 162m v1.26.2+06e8c46 huirwang-0309d-r85mj-worker-b-5txgq.c.openshift-qe.internal Ready worker 162m v1.26.2+06e8c46 `oc describe node`, there is no related egressIP annotations % oc describe node huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal Name: huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=n2-standard-4 beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-central1 failure-domain.beta.kubernetes.io/zone=us-central1-a kubernetes.io/arch=amd64 kubernetes.io/hostname=huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal kubernetes.io/os=linux machine.openshift.io/interruptible-instance= node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=n2-standard-4 node.openshift.io/os_id=rhcos topology.gke.io/zone=us-central1-a topology.kubernetes.io/region=us-central1 topology.kubernetes.io/zone=us-central1-a Annotations: csi.volume.kubernetes.io/nodeid: {"pd.csi.storage.gke.io":"projects/openshift-qe/zones/us-central1-a/instances/huirwang-0309d-r85mj-worker-a-wsrls"} k8s.ovn.org/host-addresses: ["10.0.32.117"] k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal","mac-address":"42:01:0a:00:... k8s.ovn.org/node-chassis-id: 7fb1870c-4315-4dcb-910c-0f45c71ad6d3 k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.5/16"} k8s.ovn.org/node-mgmt-port-mac-address: 16:52:e3:8c:13:e2 k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.32.117/32"} k8s.ovn.org/node-subnets: {"default":["10.131.0.0/23"]} machine.openshift.io/machine: openshift-machine-api/huirwang-0309d-r85mj-worker-a-wsrls machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-bec5065070ded51e002c566a9c5bd16a machineconfiguration.openshift.io/desiredConfig: rendered-worker-bec5065070ded51e002c566a9c5bd16a machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-bec5065070ded51e002c566a9c5bd16a machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-bec5065070ded51e002c566a9c5bd16a machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true % oc logs cloud-network-config-controller-5cd96d477d-2kmc9 -n openshift-cloud-network-config-controller W0320 03:00:08.981493 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0320 03:00:08.982280 1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock... E0320 03:00:38.982868 1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp: lookup api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com: i/o timeout E0320 03:01:23.863454 1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp: lookup api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: read udp 10.129.0.14:52109->172.30.0.10:53: read: connection refused I0320 03:02:19.249359 1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock I0320 03:02:19.250662 1 controller.go:88] Starting node controller I0320 03:02:19.250681 1 controller.go:91] Waiting for informer caches to sync for node workqueue I0320 03:02:19.250693 1 controller.go:88] Starting secret controller I0320 03:02:19.250703 1 controller.go:91] Waiting for informer caches to sync for secret workqueue I0320 03:02:19.250709 1 controller.go:88] Starting cloud-private-ip-config controller I0320 03:02:19.250715 1 controller.go:91] Waiting for informer caches to sync for cloud-private-ip-config workqueue I0320 03:02:19.258642 1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal to node workqueue I0320 03:02:19.258671 1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal to node workqueue I0320 03:02:19.258682 1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal to node workqueue I0320 03:02:19.351258 1 controller.go:96] Starting node workers I0320 03:02:19.351303 1 controller.go:102] Started node workers I0320 03:02:19.351298 1 controller.go:96] Starting secret workers I0320 03:02:19.351331 1 controller.go:102] Started secret workers I0320 03:02:19.351265 1 controller.go:96] Starting cloud-private-ip-config workers I0320 03:02:19.351508 1 controller.go:102] Started cloud-private-ip-config workers E0320 03:02:19.589704 1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-1.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue E0320 03:02:19.615551 1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-0.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue E0320 03:02:19.644628 1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-2.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue E0320 03:02:19.774047 1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-0.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue E0320 03:02:19.783309 1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-1.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue E0320 03:02:19.816430 1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-2.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
Expected results:
EgressIP should work
Additional info:
It can be reproduced in 4.12 as well, not regression issue.
Description of problem:
documentationBaseURL is still linking to 4.13
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-04-05-183601
How reproducible:
Always
Steps to Reproduce:
1. get documentationBaseURL in cm/console-config $ oc get cm console-config -n openshift-console -o yaml | grep documentationBaseURL documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-04-05-183601 True False 68m Cluster version is 4.14.0-0.nightly-2023-04-05-183601 2. 3.
Actual results:
documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/
Expected results:
documentationBaseURL should be https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/
Additional info:
We should adjust CSI RPC call timeout from sidecars to CSI driver. We seem to be using default values which are just too short and hence can cause unintended side-effects.
I am using a BuildConfig with git source and the Docker strategy. The git repo contains a large zip file via LFS and that zip file is not getting downloaded. Instead just the ascii metadata is getting downloaded. I've created a simple reproducer (https://github.com/selrahal/buildconfig-git-lfs) on my personal github. If you clone the repo
git clone git@github.com:selrahal/buildconfig-git-lfs.git
and apply the bc.yaml file with
oc apply -f bc.yaml
Then start the build with
oc start-build test-git-lfs
You will see the build fails at the unzip step in the docker file
STEP 3/7: RUN unzip migrationtoolkit-mta-cli-5.3.0-offline.zip End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive.
I've attached the full build logs to this issue.
Description of problem:
Pages should have unique page titles, so that we can gather accurate user telemetry data via segment. The page title should differ based on the selected tab.
In order to do proper analysis, branding should not be included in the page title.
Currently the following pages have this title "Red Hat OpenShift Dedicated" (or the respective brand name):
Dev perspective:
The following tabs all have the same page title Observe · Red Hat OpenShift Dedicated:
Dev perspective:
The following tabs all have the same page title Project Details · Red Hat OpenShift Dedicated:
Dev perspective:
All the user preferences tabs have the same page title : User Preferences · Red Hat OpenShift Dedicated
The Topology page in the Dev Perspective and the workloads tab of the Project Details/Workloads tab both share the same title: Topology · Red Hat OpenShift Dedicated
The following tabs on the Admin Project page all share the same title. Unsure if we can handle this since it is including the namespace name: sdoyle-dev · Details · Red Hat OpenShift Dedicated. If not, we can drop til 4.14.
Description of the problem:
As discussed on the Github PR, we want to align the severities filter with the previous implementation. Therefore the severity counts in the response headers should be:
In addition to that, we need a new response header with a total number of events with all current filters (severities included) applied.
Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/12
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Updated Description:
The MCD, during a node lifespan, can go through multiple iterations of RHEL8 and RHEL9. This was not a problem until we turned on fips enabled golang with dynamic linking. This requires the MCD binary running (either in container or on host) to always match the host built version. As an additional complication, we have an early boot process (machine-config-daemon-pull/firstboot.service) that can be different from the rest of the cluster node versions (bootimage version is not updated) as well as the fact that we chroot (dynamically go from rhel8 to rhel9) in the container, so we need a better process to ensure the right binary is always used.
Current testing of this flow in https://github.com/openshift/machine-config-operator/pull/3799
Description of problem:
MCO CI started failing this week, and 4.14 nightlies have also made it into 4.14 nightlies. See also: https://issues.redhat.com/browse/TRT-1143. The failure manifests as a warning in the MCO. Looking at a MCD log, you will see a failure like: W0712 08:52:15.475268 7971 daemon.go:1089] Got an error from auxiliary tools: kubelet health check has failed 3 times: Get "http://localhost:10248/healthz": dial tcp: lookup localhost: device or resource busy The root cause so far seems to be that 4.14 switched from a regular 1.20.3 golang to 1.20.5 with FIPS and dynamic linking in the builder, causing the failures to begin. Most functionality is not broken, but the daemon subroutine that does the kubelet health check appears to be unable to reach the localhost endpoint One possibility is that the rhel8-daemon chroot'ing into the rhel9-host and running these commands is causing the issue. Regardless, there are a bunch of issues with rhel8/rhel9 duality in the MCD that we would need to address in 4.13/4.14 Also tangentially related: https://issues.redhat.com/browse/MCO-663
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When using oc image mirror, oc creates a new manifest lists when filtering platforms. When this happens, oc still tries to push and tag the original manifest list.
Version-Release number of selected component (if applicable):
4.8
How reproducible:
Consistent
Steps to Reproduce:
1. Run oc image mirror --filter-by-os 'linux/arm' docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c yourregistry.io/busybox:target 2. Check the plan, see that the original manifest digest is being used for the tag
Actual results:
jammy:Downloads$ oc image mirror --filter-by-os 'linux/arm' docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c sparse-registry1.fyre.ibm.com/jammy/busybox:target sparse-registry1.fyre.ibm.com/ jammy/busybox blobs: docker.io/library/busybox sha256:1d57ab16f681953c15d7485bf3ee79a49c2838e5f9394c43e20e9accbb1a2b20 1.436KiB docker.io/library/busybox sha256:99ee43e96ff50e90c5753954d7ce2dfdbd7eb9711c1cd96de56d429cb628e343 1.436KiB docker.io/library/busybox sha256:a22ab831b2b2565a624635af04e5f76b4554d9c84727bf7e6bc83306b3b339a9 1.436KiB docker.io/library/busybox sha256:abaa813f94fdeebd3b8e6aeea861ab474a5c4724d16f1158755ff1e3a4fde8b0 1.438KiB docker.io/library/busybox sha256:b203a35cab50f0416dfdb1b2260f83761cb82197544b9b7a2111eaa9c755dbe7 937.1KiB docker.io/library/busybox sha256:46758452d3eef8cacb188405495d52d265f0c3a7580dfec51cb627c04c7bafc4 1.604MiB docker.io/library/busybox sha256:4c45e4bb3be9dbdfb27c09ac23c050b9e6eb4c16868287c8c31d34814008df80 1.847MiB docker.io/library/busybox sha256:f78e6840ded1aafb6c9f265f52c2fc7c0a990813ccf96702df84a7dcdbe48bea 1.908MiB manifests: sha256:4ff685e2bcafdab0d2a9b15cbfd9d28f5dfe69af97e3bb1987ed483b0abf5a99 sha256:5e42fbc46b177f10319e8937dd39702e7891ce6d8a42d60c1b4f433f94200bd2 sha256:7128d7c7704fb628f1cedf161c01d929d3d831f2a012780b8191dae49f79a5fc sha256:77ed5ebc3d9d48581e8afcb75b4974978321bd74f018613483570fcd61a15de8 sha256:dde8e930c7b6a490f728e66292bc9bce42efc9bbb5278bae40e4f30f6e00fe8c sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c -> target
Expected results:
jammy:~$ oc-devel image mirror --filter-by-os 'linux/arm' docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c sparse-registry1.fyre.ibm.com/jammy/busybox:target sparse-registry1.fyre.ibm.com/ jammy/busybox blobs: docker.io/library/busybox sha256:1d57ab16f681953c15d7485bf3ee79a49c2838e5f9394c43e20e9accbb1a2b20 1.436KiB docker.io/library/busybox sha256:99ee43e96ff50e90c5753954d7ce2dfdbd7eb9711c1cd96de56d429cb628e343 1.436KiB docker.io/library/busybox sha256:a22ab831b2b2565a624635af04e5f76b4554d9c84727bf7e6bc83306b3b339a9 1.436KiB docker.io/library/busybox sha256:abaa813f94fdeebd3b8e6aeea861ab474a5c4724d16f1158755ff1e3a4fde8b0 1.438KiB docker.io/library/busybox sha256:b203a35cab50f0416dfdb1b2260f83761cb82197544b9b7a2111eaa9c755dbe7 937.1KiB docker.io/library/busybox sha256:46758452d3eef8cacb188405495d52d265f0c3a7580dfec51cb627c04c7bafc4 1.604MiB docker.io/library/busybox sha256:4c45e4bb3be9dbdfb27c09ac23c050b9e6eb4c16868287c8c31d34814008df80 1.847MiB docker.io/library/busybox sha256:f78e6840ded1aafb6c9f265f52c2fc7c0a990813ccf96702df84a7dcdbe48bea 1.908MiB manifests: sha256:4ff685e2bcafdab0d2a9b15cbfd9d28f5dfe69af97e3bb1987ed483b0abf5a99 sha256:5e42fbc46b177f10319e8937dd39702e7891ce6d8a42d60c1b4f433f94200bd2 sha256:7128d7c7704fb628f1cedf161c01d929d3d831f2a012780b8191dae49f79a5fc sha256:77ed5ebc3d9d48581e8afcb75b4974978321bd74f018613483570fcd61a15de8 sha256:dde8e930c7b6a490f728e66292bc9bce42efc9bbb5278bae40e4f30f6e00fe8c sha256:7128d7c7704fb628f1cedf161c01d929d3d831f2a012780b8191dae49f79a5fc -> target
Additional info:
Description of problem:
The IPI installation in some regions got bootstrap failure, and without any node available/ready.
Version-Release number of selected component (if applicable):
12-22 16:22:27.970 ./openshift-install 4.12.0-0.nightly-2022-12-21-202045 12-22 16:22:27.970 built from commit 3f9c38a5717c638f952df82349c45c7d6964fcd9 12-22 16:22:27.970 release image registry.ci.openshift.org/ocp/release@sha256:2d910488f25e2638b6d61cda2fb2ca5de06eee5882c0b77e6ed08aa7fe680270 12-22 16:22:27.971 release architecture amd64
How reproducible:
Always
Steps to Reproduce:
1. try the IPI installation in the problem regions (so far tried and failed with ap-southeast-2, ap-south-1, eu-west-1, ap-southeast-6, ap-southeast-3, ap-southeast-5, eu-central-1, cn-shanghai, cn-hangzhou and cn-beijing)
Actual results:
Bootstrap failed to complete
Expected results:
Installation in those regions should succeed.
Additional info:
FYI the QE flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/166672/ No any node available/ready, and no any operator available. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 30m Unable to apply 4.12.0-0.nightly-2022-12-21-202045: an unknown error has occurred: MultipleErrors $ oc get nodes No resources found $ oc get machines -n openshift-machine-api -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE jiwei-1222f-v729x-master-0 30m jiwei-1222f-v729x-master-1 30m jiwei-1222f-v729x-master-2 30m $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication baremetal cloud-controller-manager cloud-credential cluster-autoscaler config-operator console control-plane-machine-set csi-snapshot-controller dns etcd image-registry ingress insights kube-apiserver kube-controller-manager kube-scheduler kube-storage-version-migrator machine-api machine-approver machine-config marketplace monitoring network node-tuning openshift-apiserver openshift-controller-manager openshift-samples operator-lifecycle-manager operator-lifecycle-manager-catalog operator-lifecycle-manager-packageserver service-ca storage $ Mater nodes don't run for example kubelet and crio services. [core@jiwei-1222f-v729x-master-0 ~]$ sudo crictl ps FATA[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" [core@jiwei-1222f-v729x-master-0 ~]$ The machine-config-daemon firstboot tells "failed to update OS". [jiwei@jiwei log-bundle-20221222085846]$ grep -Ei 'error|failed' control-plane/10.0.187.123/journals/journal.log Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors. Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors. Dec 22 16:24:18 localhost ignition[867]: failed to fetch config: resource requires networking Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <info> [1671726259.0329] hostname: hostname: hostnamed not used as proxy creation failed with: Could not connect: No such file or directory Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <warn> [1671726259.0464] sleep-monitor-sd: failed to acquire D-Bus proxy: Could not connect: No such file or directory Dec 22 16:24:19 localhost.localdomain ignition[891]: GET error: Get "https://api-int.jiwei-1222f.alicloud-qe.devcluster.openshift.com:22623/config/master": dial tcp 10.0.187.120:22623: connect: connection refused ...repeated logs omitted... Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-ctl[1888]: 2022-12-22T16:27:46Z|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-vswitchd[1888]: ovs|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory Dec 22 16:27:46 jiwei-1222f-v729x-master-0 dbus-daemon[1669]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found. Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1924]: Error: Device '' not found. Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1937]: Error: Device '' not found. Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[2037]: Error: Device '' not found. Dec 22 08:35:32 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:35:32.477770 2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-910221290 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0 Dec 22 08:56:06 jiwei-1222f-v729x-master-0 rpm-ostree[2288]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: W1222 08:56:06.785425 2181 firstboot_complete_machineconfig.go:46] error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511: Warning: The unit file, source configuration file or drop-ins of rpm-ostreed.service changed on disk. Run 'systemctl daemon-reload' to reload units. Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: error: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout Dec 22 08:57:31 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:57:31.244684 2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-4021566291 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0 Dec 22 08:59:20 jiwei-1222f-v729x-master-0 systemd[2353]: /usr/lib/systemd/user/podman-kube@.service:10: Failed to parse service restart specifier, ignoring: never Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2437]: Error: open default: no such file or directory Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2450]: Error: failed to start API service: accept unixgram @00026: accept4: operation not supported Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman-kube@default.service: Failed with result 'exit-code'. Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: Failed to start A template for running K8s workloads via podman-play-kube. Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman.service: Failed with result 'exit-code'. [jiwei@jiwei log-bundle-20221222085846]$
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Route Checkbox getting checked even if it is unchecked during editing the Serverless Function form.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Install Serverless Operator and Create KN Serving Instance 2. Create a Serverless Function and open the Edit form of the KSVC 3. Uncheck the Create Route option and save. 4. Reopen the Edit form again.
Actual results:
The checkbox still shows checked.
Expected results:
It should retain the previous condtion.
Additional info:
Description of problem:
Opened the web-console and navigate to Dashboards, the default API performance V2 option selected, shows No datapoints found for each sub-pages.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-27-000502
How reproducible:
always
Steps to Reproduce:
1. Open the web-console and navigate to Dashboards, the default API performance V2 option selected, shows No datapoints found for each sub-pages.
Actual results:
No datapoints found for Dashboards default API performance V2 option and shows blank page.
Expected results:
Should show diagrams for Dashboards default API performance V2 option
Additional info:
This blocked bug https://issues.redhat.com/browse/OCPBUGS-14940, when I filed the bug https://issues.redhat.com/browse/OCPBUGS-14940, not seen this.
Description of problem:
OVN image pre-puller blocks upgrades in environments where the images have already been pulled but the registry server is not available.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster in a disconnected environment.
2. Manually pre-pull all the images required for the upgrade. For example, get the list of images needed:
# oc adm release info quay.io/openshift-release-dev/ocp-release:4.12.10-x86_64 -o json > release-info.json
And then pull them in all the nodes of the cluster:
# crio pull $(cat release-info.json | jq -r '.references.spec.tags[].from.name')
3. Stop or somehow make the registry unreachable, then trigger the upgrade.
Actual results:
The upgrade blocks with the following error reported by the cluster version operator:
# oc get clusterversion; oc get co network NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.10 True True 62m Working towards 4.12.11: 483 of 830 done (58% complete), waiting on network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE network 4.12.10 True True False 133m DaemonSet "/openshift-ovn-kubernetes/ovnkube-upgrades-prepuller" is not available (awaiting 1 nodes)
The reason for that is that the `ovnkube-upgrades-prepuller-...` pod uses `imagePullPolicy: Always` and that fails if there is no registry, even if the image has already been pulled:
# oc get pods -n openshift-ovn-kubernetes ovnkube-upgrades-prepuller-5s2cn NAME READY STATUS RESTARTS AGE ovnkube-upgrades-prepuller-5s2cn 0/1 ImagePullBackOff 0 44m # oc get events -n openshift-ovn-kubernetes --field-selector involvedObject.kind=Pod,involvedObject.name=ovnkube-upgrades-prepuller-5s2cn,reason=Failed LAST SEEN TYPE REASON OBJECT MESSAGE 43m Warning Failed pod/ovnkube-upgrades-prepuller-5s2cn Failed to pull image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:52f189797a83cae8769f1a4dc6dfd46d586914575ee99de6566fc23c77282071": rpc error: code = Unknown desc = (Mirrors also failed: [server.home.arpa:8443/openshift/release@sha256:52f189797a83cae8769f1a4dc6dfd46d586914575ee99de6566fc23c77282071: pinging container registry server.home.arpa:8443: Get "https://server.home.arpa:8443/v2/": dial tcp 192.168.100.1:8443: connect: connection refused]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:52f189797a83cae8769f1a4dc6dfd46d586914575ee99de6566fc23c77282071: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 192.168.100.1:53: server misbehaving 43m Warning Failed pod/ovnkube-upgrades-prepuller-5s2cn Error: ErrImagePull 43m Warning Failed pod/ovnkube-upgrades-prepuller-5s2cn Error: ImagePullBackOff # oc get pod -n openshift-ovn-kubernetes ovnkube-upgrades-prepuller-5s2cn -o json | jq -r '.spec.containers[] | .imagePullPolicy + " " + .image' Always quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:52f189797a83cae8769f1a4dc6dfd46d586914575ee99de6566fc23c77282071
Expected results:
The upgrade should not block.
Additional info:
We detected this in a situation where we want to be able to perform upgrades in a disconnected environment and without the registry server running. See MGMT-13733 for details.
When using the --oci-registries-config flag explicitly or getting registries.conf from the environment, execution time when processing related images via the addRelatedImageToMapping function serially can drastically impact performance depending on the number of images involved. In my testing of a large catalog, there were approximately 470 images and this took approximately 13 minutes. This processing occurs prior to letting the underlying oc mirror code plan out the images that should be mirrored. Actual planning time is consistent at around 1 min 30 seconds.
The cause of this is due to the need to determine mirrors for each one of the related images based on the configuration provided in registries.conf, and this action is done serially in a loop. If I introduce parallel execution, the processing time for addRelatedImageToMapping is reduced from ~13 min to ~14 seconds.
Note: the catalog used here is publicly available, but the related images are not so this may be difficult to reproduce.
mkdir -p /tmp/oci/registriesconf/performance
skopeo --override-os linux copy docker://quay.io/jhunkins/ocp13762:v1 oci:///tmp/oci/registriesconf/performance --format v2s2
[[registry]] location = "icr.io/cpopen" insecure = false blocked = false mirror-by-digest-only = true prefix = "" [[registry.mirror]] location = "quay.io/jhunkins" insecure = false
kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: oci:///tmp/oci/registriesconf/performance full: true targetTag: latest targetCatalog: ibm-catalog storageConfig: local: path: /tmp/oc-mirror-temp
oc mirror --config [path to isc]/isc-registriesconf-performance.yaml --include-local-oci-catalogs --oci-insecure-signature-policy --dest-use-http docker://localhost:5000/oci --skip-cleanup --dry-run
roughly 13 minutes elapses before the planning phase begins
much faster execution before the planning phase begins
I intend to create a PR which adds parallel execution around the addRelatedImageToMapping function
Description of problem:
A cluster installed via ACM and nodes are showing as Unmanaged. When trying to set the BMH credential via console, the Apply button is not clickable(greyed out).
Version-Release number of selected component (if applicable): 4.11
How reproducible: Always
Steps to Reproduce:
1. Install a cluster via ACM
2. Setting a BMH credential on console
3.
Actual results:
The Apply button on the console screen is greyed out, unclickable.
Expected results:
Should be able to configure BHM credential
Additional info:{code:none}
Based on a suggestion from Omer
"Now that we can tell apart user manifests from our own service manifests, I think it's best that this function deletes the service manifests.
https://github.com/openshift/assisted-service/blob/master/internal/cluster/cluster.go#L1418
The original motivation for this skip was that we didn't want to destroy user uploaded manifests when the user resets their installation, but preserving the service generated ones is useless, and was just an unfortunate side-effect of protecting the user manifests. The service ones would anyway get regenerated when the user hits install again, there's no point in protecting them. If anything, clearing those manifests I think this might solve some edge case bugs I can think of"
We will need to wait for https://github.com/openshift/assisted-service/pull/5278/files to be merged before starting this as this depends on changes made in this PR
Instead of creating a new MC 97-{master/worker}-generated-kubelet to set the default cgroups version, it is better to set it via a template.
Description of problem:
The alerts table displays incorrect values (Prometheus) in the source column
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Install LokiOperator, Cluster Logging operator and enable the logging view plugin with the alerts feature toggle enabled 2. Add a log-based alert 3. Check the alerts table source in the observe -> alerts section
Actual results:
Incorrect "Prometheus" value is displayed for non log-based alerts
Expected results:
"Platform" or "User" value is displayed for non log-based alerts
Additional info:
Description of problem:
When HyperShift HostedClusters are created with "OLMCatalogPlacement" set to "guest" and if the desired release is pre-GA, the CatalogSource pods cannot pull their images due to using unreleased images.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Common
Steps to Reproduce:
1. Create a HyperShift 4.13 HostedCluster with spec.OLMCatalogPlacement = "guest" 2. See the openshift-marketplace/community-operator-* pods in the guest cluster in ImagePullBackoff
Actual results:
openshift-marketplace/community-operator-* pods in the guest cluster in ImagePullBackoff
Expected results:
All CatalogSource pods to be running and to use n-1 images if pre-GA
Additional info:
This is a clone of issue OCPBUGS-18800. The following is the description of the original issue:
—
Description of problem:
currently the mco updates its image registry certificate configmap by deleting and re-creating it on each MCO sync. Instead, we should be patching it
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
When processing an install-config containing either BMC passwords in the baremetal platform config, or a vSphere password in the vsphere platform config, we log a warning message to say that the value is ignored.
This warning currently includes the value in the password field, which may be inconvenient for users reusing IPI configs who don't want their password values to appear in logs.
Description of problem:
Once the https://issues.redhat.com/browse/OCPBUGS-14783 is fixed we found another issue which prevents the KubeapiServer's init-container to finish successful. The init-container tries to reach the Kubeapiserver in a ipv4 based url and that's not up, it should go to the IPv6 one.
Description of problem:
Hypershift kubevirt provider hosted cluster cannot start up after activating ovn-k interconnect at hosted cluster. The issue is that ovn-k configurations missmatch: The cluster manager config in the hosted cluster namespace: ovnkube.conf: |- [default] mtu="8801" cluster-subnets="10.132.0.0/14/23" encap-port="9880" enable-lflow-cache=true lflow-cache-limit-kb=1048576 [kubernetes] service-cidrs="172.31.0.0/16" ovn-config-namespace="openshift-ovn-kubernetes" cacert="/hosted-ca/ca.crt" apiserver="https://kube-apiserver:6443" host-network-namespace="openshift-host-network" platform-type="KubeVirt" dns-service-namespace="openshift-dns" dns-service-name="dns-default" [ovnkubernetesfeature] enable-egress-ip=true enable-egress-firewall=true enable-egress-qos=true enable-egress-service=true egressip-node-healthcheck-port=9107 [gateway] mode=shared nodeport=true v4-join-subnet="100.65.0.0/16" [masterha] election-lease-duration=137 election-renew-deadline=107 election-retry-period=26 The controller config in the hosted cluster ovnkube.conf: |- [default] mtu="8801" cluster-subnets="10.132.0.0/14/23" encap-port="9880" enable-lflow-cache=true lflow-cache-limit-kb=1048576 enable-udp-aggregation=true [kubernetes] service-cidrs="172.31.0.0/16" ovn-config-namespace="openshift-ovn-kubernetes" apiserver="https://a392ee248c42a4ffca67f2909823466e-18e866c0f5fb5880.elb.us-west-2.amazonaws.com:6443" host-network-namespace="openshift-host-network" platform-type="KubeVirt" healthz-bind-address="0.0.0.0:10256" dns-service-namespace="openshift-dns" dns-service-name="dns-default" [ovnkubernetesfeature] enable-egress-ip=true enable-egress-firewall=true enable-egress-qos=true enable-egress-service=true egressip-node-healthcheck-port=9107 enable-multi-network=true [gateway] mode=shared nodeport=true [masterha] election-lease-duration=137 election-renew-deadline=107 election-retry-period=26
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Deploy latest 4.14 ocp clustrer 2. Install latest hypershift operator 3. Deploy hosted cluster with latest 4.14 ocp release image
Actual results:
Hosted cluster get stuck at network 4.14.0-0.ci-2023-08-20-221659 True True False 3h53m DaemonSet "/openshift-multus/network-metrics-daemon" is waiting for other operators to become ready...
Expected results:
All the hosted clusters operators should be ok
Additional info:
This is a clone of issue OCPBUGS-19699. The following is the description of the original issue:
—
Description of problem:
When CPUPartitioning is not set in install-config.yaml a warning message is still generated WARNING CPUPartitioning: is ignored This warning is both incorrect, since the check is against "None" and the the value is an empty string when not set, and also no longer relevant now that https://issues.redhat.com//browse/OCPBUGS-18876 has been fixed.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Create an install config with CPUPartitioning not set 2. Run "openshift-install agent create image --dir cluster-manifests/ --log-level debug"
Actual results:
See the output "WARNING CPUPartitioning: is ignored"
Expected results:
No warning
Additional info:
Description of problem:
Since the `registry.centos.org` is closed, all the unit tests in oc relying on this registry started failing.
Version-Release number of selected component (if applicable):
all versions
How reproducible:
trigger CI jobs and see unit tests are failing
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
We use the state machine design pattern to have explicit clear rules for how hosts can move in and out of states depending on the things that are happening.
This makes it relatively easy to follow / understand host behavior.
We should ensure our code doesn't contain places where we force a host into a state, without going through the state machine 🍝, otherwise it beats the purpose of having a state machine
One example that personally confused me is this switch statement, which contains updates like this one , this one and this one and also this one
Description of problem:
After the changes of OCPBUGS-3036 and OCPBUGS-11596, the user who has project admin permision would be able to check all the subscription information on the operaotor details page. But currently the installPlan infromation will shown "None" in the page wich is incorrect
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-05-03-163151
How reproducible:
Always
Steps to Reproduce:
1. Configure IDP. add a user 2. Install any operator in specific namespace 3. Assign project admin permission to the user for the same namespace $ oc adm policy add-role-to-user admin <username> -n <projectname> 4. Check user have enough permission to check installplan via CLI $ oc get clusterrole admin -o yaml | grep -C10 installplan - apiGroups: - operators.coreos.com resources: - clusterserviceversions - catalogsources - installplans - subscriptions verbs: - delete - apiGroups: - operators.coreos.com resources: - clusterserviceversions - catalogsources - installplans - subscriptions - operatorgroups verbs: - get - list - watch 4. Login OCP with the user, and go to InstallPlan page, user is able to check the InstallPlan list without any error /k8s/ns/<projectname>/operators.coreos.com~v1alpha1~InstallPlan 5. Navigate to OperatorDetails -> Subscription Tab, check if the 'InstallPlan' name could be shown on page
Actual results:
Only 'None' is shown on the InstallPlan section
Expected results:
The installplan name can be shown on the subsctiption page
Additional info:
Description of problem:
Since registry.centos.org is closed, tests relying on this registry in e2e-agnostic-ovn-cmd job are failing.
Version-Release number of selected component (if applicable):
all
How reproducible:
Trigger e2e-agnostic-ovn-cmd job
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As part of issue - https://issues.redhat.com/browse/OCPBUGS-14352 pipeline e2e tests are disabled. Enable pipeline e2e tests again.
Description of problem:
Cluster does not finish rolling out on a 4.13 management cluster because of pod security constraints.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1.Install 4.14 hypershift operator on a recent 4.13 mgmt cluster 2.Create an AWS PublicAndPrivate hosted cluster on that hypershift cluster
Actual results:
Hosted cluster stalls rollout because the private router never gets created
Expected results:
Hosted cluster comes up successfully
Additional info:
Pod security enforcement is preventing the private router from getting created.
PRs were previously merged to add SC2S support via AWS SDK here:
However, further updates to add support for SC2S region (us-isob-east-1) and new TC2S region (us-iso-west-1) are still required.
There are still hard coded references to the old regions in the follow locations.
Description of problem:
Altering the ImageURL or ExtraKernelParams values in a PreprovisioningImage CR should cause the host to boot using the new image or parameters, but currently the host doesn't respond at all to changes in those fields.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-01-11-225449
How reproducible:
Always
Steps to Reproduce:
1. Create a BMH 2. Set preprovisioning image image URL 3. Allow host to boot 4. Change image URL or extra kernel params
Actual results:
Host does not reboot
Expected results:
Host reboots using the newly provided image or parameters
Additional info:
BMH:
- apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: annotations: inspect.metal3.io: disabled creationTimestamp: "2023-01-13T16:06:12Z" finalizers: - baremetalhost.metal3.io generation: 4 labels: infraenvs.agent-install.openshift.io: myinfraenv name: ostest-extraworker-0 namespace: assisted-installer resourceVersion: "61077" uid: 444d7246-3d0a-4188-a8c4-f407ee4f741f spec: automatedCleaningMode: disabled bmc: address: redfish+http://192.168.111.1:8000/redfish/v1/Systems/6f45ba9f-251a-46f7-a7a8-10c6ca9231dd credentialsName: ostest-extraworker-0-bmc-secret bootMACAddress: 00:b2:71:b8:14:4f customDeploy: method: start_assisted_install online: true status: errorCount: 0 errorMessage: "" goodCredentials: credentials: name: ostest-extraworker-0-bmc-secret namespace: assisted-installer credentialsVersion: "44478" hardwareProfile: unknown lastUpdated: "2023-01-13T16:06:22Z" operationHistory: deprovision: end: null start: null inspect: end: null start: null provision: end: null start: "2023-01-13T16:06:22Z" register: end: "2023-01-13T16:06:22Z" start: "2023-01-13T16:06:12Z" operationalStatus: OK poweredOn: false provisioning: ID: b5e8c1a9-8061-420b-8c32-bb29a8b35a0b bootMode: UEFI image: url: "" raid: hardwareRAIDVolumes: null softwareRAIDVolumes: [] rootDeviceHints: deviceName: /dev/sda state: provisioning triedCredentials: credentials: name: ostest-extraworker-0-bmc-secret namespace: assisted-installer credentialsVersion: "44478"
Preprovisioning Image (with changes)
- apiVersion: metal3.io/v1alpha1 kind: PreprovisioningImage metadata: creationTimestamp: "2023-01-13T16:06:22Z" generation: 1 labels: infraenvs.agent-install.openshift.io: myinfraenv name: ostest-extraworker-0 namespace: assisted-installer ownerReferences: - apiVersion: metal3.io/v1alpha1 blockOwnerDeletion: true controller: true kind: BareMetalHost name: ostest-extraworker-0 uid: 444d7246-3d0a-4188-a8c4-f407ee4f741f resourceVersion: "56838" uid: 37f4da76-0d1c-4e05-b618-2f0ab9d5c974 spec: acceptFormats: - initrd architecture: x86_64 status: architecture: x86_64 conditions: - lastTransitionTime: "2023-01-13T16:34:26Z" message: Image has been created observedGeneration: 1 reason: ImageCreated status: "True" type: Ready - lastTransitionTime: "2023-01-13T16:06:24Z" message: Image has been created observedGeneration: 1 reason: ImageCreated status: "False" type: Error extraKernelParams: coreos.live.rootfs_url=https://assisted-image-service-assisted-installer.apps.ostest.test.metalkube.org/boot-artifacts/rootfs?arch=x86_64&version=4.12 rd.break=initqueue format: initrd imageUrl: https://assisted-image-service-assisted-installer.apps.ostest.test.metalkube.org/images/79ef3924-ee94-42c6-96c3-2d784283120d/pxe-initrd?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI3OWVmMzkyNC1lZTk0LTQyYzYtOTZjMy0yZDc4NDI4MzEyMGQifQ.YazOZS01NoI7g_eVhLmRNmM6wKVVaZJdWbxuePia46Fo0GMLYtSOp1JTvtcStoT51g7VkSnTf8LBJ0zmbGu3HQ&arch=x86_64&version=4.12 kernelUrl: https://assisted-image-service-assisted-installer.apps.ostest.test.metalkube.org/boot-artifacts/kernel?arch=x86_64&version=4.12 networkData: {}
This was found while testing ZTP so in this case the assisted-service controllers are altering the preprovisioning image in response to changes made in the assisted-specific CRs, but I don't think this issue is ZTP specific.
Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/68
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Add snyk-secret to parameters to the push & pull tekton files so that snyk scan will be performed on HO RHTAP builds.
The 3.0.1 version seems to have some important fixes about vsphere CSI driver crashes. We should backport those fixes to 4.13 and 4.14
Porting rhbz#2057740 to Jira. Pods without a controller: true entry in ownerReferences are not gracefully drained by the autoscaler (and potentially other drain-library drainers). Checking a recent 4.13 CI run:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn/1625150492994703360/artifacts/e2e-aws-ovn/gather-extra/artifacts/pods.json | jq -r '.items[].metadata | select([(.ownerReferences // [])[] | select(.controller)] | length == 0) | .namespace + " " + .name + " " + (.ownerReferences | tostring)' | grep -v '^\(openshift-etcd\|openshift-kube-apiserver\|openshift-kube-controller-manager\|openshift-kube-scheduler\) ' openshift-marketplace certified-operators-fnm5z [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"certified-operators","uid":"4eb36072-7c56-4663-9b5a-fd23cee85432"}] openshift-marketplace community-operators-nrfl6 [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"community-operators","uid":"0e164593-5656-4592-9915-1a5367a6a548"}] openshift-marketplace redhat-marketplace-7j7k9 [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"redhat-marketplace","uid":"14b910c4-0e45-4188-ab57-671070b6a9f1"}] openshift-marketplace redhat-operators-hxhxw [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"redhat-operators","uid":"ca9028e5-affb-4537-81f1-15e3a5129c6e"}]
At least 4.11 and 4.13 (above). Likely all OpenShift 4.y which have had these openshift-marketplace pods.
100%
1. Launch a cluster.
2. Inspect the openshift-marketplace pods with: oc -n openshift-marketplace get -o json pods | jq -r '.items[].metadata | select(.namespace == "openshift-marketplace" and (([.ownerReferences[] | select(.controller == true)]) | length) == 0) | .name + " " + (.ownerReferences | tostring)'
certified-operators-fnm5z [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"certified-operators","uid":"4eb36072-7c56-4663-9b5a-fd23cee85432"}] community-operators-nrfl6 [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"community-operators","uid":"0e164593-5656-4592-9915-1a5367a6a548"}] redhat-marketplace-7j7k9 [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"redhat-marketplace","uid":"14b910c4-0e45-4188-ab57-671070b6a9f1"}] redhat-operators-hxhxw [{"apiVersion":"operators.coreos.com/v1alpha1","blockOwnerDeletion":false,"controller":false,"kind":"CatalogSource","name":"redhat-operators","uid":"ca9028e5-affb-4537-81f1-15e3a5129c6e"}]
No output.
Figuring out which resource to list as the controller is tricky, but there are workarounds, including pointing at the triggering resource or a ClusterOperator as the controller.
Description of problem:
The chk_default_ingress.sh script for keepalived is not correctly matching the default ingress pod name anymore. The pod name in a recently deployed dev-scripts cluster is router-default-97fb6b94c-wfxfk which does not match our grep pattern of router-default-[[:xdigit:]]\\{10}-[[:alnum:]]
{5}{}. The main issue seems to be that the first id is only 9 digits, not 10.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Unsure, but has been seen at least twice
Steps to Reproduce:
1. Deploy recent nightly build 2. Look at chk_default_ingress status 3.
Actual results:
Always failing, even on nodes with the default ingress pod
Expected results:
Passes on nodes with default ingress pod
Additional info:
DoD:
e2e to run NodePool without setting SG in spec
Description of problem:
ci job "amd64-nightly-4.13-upgrade-from-stable-4.12-vsphere-ipi-proxy-workers-rhel8" failed at rhel node upgrade stage with following error: TASK [openshift_node : Apply machine config] ***********************************3583task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/apply_machine_config.yml:683584Using module file /opt/python-env/ansible-core/lib64/python3.8/site-packages/ansible/modules/command.py3585Pipelining is enabled.3586<192.168.233.236> ESTABLISH SSH CONNECTION FOR USER: test3587<192.168.233.236> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="test"' -o ConnectTimeout=30 -o IdentityFile=/var/run/secrets/ci.openshift.io/cluster-profile/ssh-privatekey -o StrictHostKeyChecking=no -o 'ControlPath="/alabama/.ansible/cp/%h-%r"' 192.168.233.236 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-vwugynewkogzaosazvikpnplnmjoluxs ; http_proxy=http://XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX@192.168.221.228:3128 https_proxy=http://XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX@192.168.221.228:3128 no_proxy=.cluster.local,.svc,10.128.0.0/14,127.0.0.1,172.30.0.0/16,192.168.233.0/25,api-int.ci-op-ssnlf4qb-1dacf.vmc-ci.devcluster.openshift.com,localhost /usr/libexec/platform-python'"'"'"'"'"'"'"'"' && sleep 0'"'"''3588Escalation succeeded3589<192.168.233.236> (1, b'\n{"changed": XXXX, "stdout": "I0726 23:36:56.436283 27240 start.go:61] Version: v4.13.0-202307242035.p0.g7b54f1d.assembly.stream-dirty (7b54f1dcce4ea9f69f300d0e1cf2316def45bf72)\\r\\nI0726 23:36:56.437075 27240 daemon.go:478] not chrooting for source=rhel-8 target=rhel-8\\r\\nF0726 23:36:56.437240 27240 start.go:75] failed to re-exec: writing /rootfs/run/bin/machine-config-daemon: open /rootfs/run/bin/machine-config-daemon: text file busy", "stderr": "time=\\"2023-07-26T19:36:55-04:00\\" level=warning msg=\\"The input device is not a TTY. The --tty and --interactive flags might not work properly\\"", "rc": 255, "cmd": ["podman", "run", "-v", "/:/rootfs", "--pid=host", "--privileged", "--rm", "--entrypoint=/usr/bin/machine-config-daemon", "-ti", "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0110276ce82958a105cdd59028043bcdb1e5c33a77e550a13a1dc51aee08b032", "start", "--node-name", "ci-op-ssnlf4qb-1dacf-bbmqt-rhel-1", "--once-from", "/tmp/ansible.mlldlsm5/worker_ignition_config.json", "--skip-reboot"], "start": "2023-07-26 19:36:55.852527", "end": "2023-07-26 19:36:56.827081", "delta": "0:00:00.974554", "failed": XXXX, "msg": "non-zero return code", "invocation": {"module_args": {"_raw_params": "podman run -v /:/rootfs --pid=host --privileged --rm --entrypoint=/usr/bin/machine-config-daemon -ti quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0110276ce82958a105cdd59028043bcdb1e5c33a77e550a13a1dc51aee08b032 start --node-name ci-op-ssnlf4qb-1dacf-bbmqt-rhel-1 --once-from /tmp/ansible.mlldlsm5/worker_ignition_config.json --skip-reboot", "_uses_shell": false, "warn": false, "stdin_add_newline": XXXX, "strip_empty_ends": XXXX, "argv": null, "chdir": null, "executable": null, "creates": null, "removes": null, "stdin": null}}}\n', b'')3590<192.168.233.236> Failed to connect to the host via ssh: 3591fatal: [192.168.233.236]: FAILED! => {3592 "changed": XXXX,3593 "cmd": [3594 "podman",3595 "run",3596 "-v",3597 "/:/rootfs",3598 "--pid=host",3599 "--privileged",3600 "--rm",3601 "--entrypoint=/usr/bin/machine-config-daemon",3602 "-ti",3603 "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0110276ce82958a105cdd59028043bcdb1e5c33a77e550a13a1dc51aee08b032",3604 "start",3605 "--node-name",3606 "ci-op-ssnlf4qb-1dacf-bbmqt-rhel-1",3607 "--once-from",3608 "/tmp/ansible.mlldlsm5/worker_ignition_config.json",3609 "--skip-reboot"3610 ],3611 "delta": "0:00:00.974554",3612 "end": "2023-07-26 19:36:56.827081",3613 "invocation": {3614 "module_args": {3615 "_raw_params": "podman run -v /:/rootfs --pid=host --privileged --rm --entrypoint=/usr/bin/machine-config-daemon -ti quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0110276ce82958a105cdd59028043bcdb1e5c33a77e550a13a1dc51aee08b032 start --node-name ci-op-ssnlf4qb-1dacf-bbmqt-rhel-1 --once-from /tmp/ansible.mlldlsm5/worker_ignition_config.json --skip-reboot",3616 "_uses_shell": false,3617 "argv": null,3618 "chdir": null,3619 "creates": null,3620 "executable": null,3621 "removes": null,3622 "stdin": null,3623 "stdin_add_newline": XXXX,3624 "strip_empty_ends": XXXX,3625 "warn": false3626 }3627 },3628 "msg": "non-zero return code",3629 "rc": 255,3630 "start": "2023-07-26 19:36:55.852527",3631 "stderr": "time=\"2023-07-26T19:36:55-04:00\" level=warning msg=\"The input device is not a TTY. The --tty and --interactive flags might not work properly\"",3632 "stderr_lines": [3633 "time=\"2023-07-26T19:36:55-04:00\" level=warning msg=\"The input device is not a TTY. The --tty and --interactive flags might not work properly\""3634 ],3635 "stdout": "I0726 23:36:56.436283 27240 start.go:61] Version: v4.13.0-202307242035.p0.g7b54f1d.assembly.stream-dirty (7b54f1dcce4ea9f69f300d0e1cf2316def45bf72)\r\nI0726 23:36:56.437075 27240 daemon.go:478] not chrooting for source=rhel-8 target=rhel-8\r\nF0726 23:36:56.437240 27240 start.go:75] failed to re-exec: writing /rootfs/run/bin/machine-config-daemon: open /rootfs/run/bin/machine-config-daemon: text file busy",3636 "stdout_lines": [3637 "I0726 23:36:56.436283 27240 start.go:61] Version: v4.13.0-202307242035.p0.g7b54f1d.assembly.stream-dirty (7b54f1dcce4ea9f69f300d0e1cf2316def45bf72)",3638 "I0726 23:36:56.437075 27240 daemon.go:478] not chrooting for source=rhel-8 target=rhel-8",3639 "F0726 23:36:56.437240 27240 start.go:75] failed to re-exec: writing /rootfs/run/bin/machine-config-daemon: open /rootfs/run/bin/machine-config-daemon: text file busy"3640 ]3641}3642
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-07-26-101700
How reproducible:
always
Steps to Reproduce:
Found in ci: 1. Install a v4.13.6 cluster with rhel8 node 2. Upgrade ocp succeed 3. Upgrade rhel node
Actual results:
rhel node upgrade failed
Expected results:
rhel node upgrade succeed
Additional info:
job link: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.13-amd64-nightly-4.13-upgrade-from-stable-4.12-vsphere-ipi-proxy-workers-rhel8-p2-f28/1684288836412116992
Description of problem:
A customer is raising security concerns about using port 80 for bootstrap
Version-Release number of selected component (if applicable):
4.13
We should include HostedClusterDegraded in hypershift_hostedclusters_failure_conditions metric so it's obvious when there's an issue across the fleet.
Description of problem:
This issue is triggered by the lack of the file "/etc/kubernetes/kubeconfig" in the node, but what i found interesting is the aesthetic error that follows: 2023-01-04T10:56:50.807982171Z I0104 10:56:50.807918 18013 start.go:112] Version: v4.11.0-202212070335.p0.g60746a8.assembly.stream-dirty (60746a843e7ef8855ae00f2ffcb655c53e0e8296) 2023-01-04T10:56:50.810326376Z I0104 10:56:50.810190 18013 start.go:125] Calling chroot("/rootfs") 2023-01-04T10:56:50.810326376Z I0104 10:56:50.810274 18013 update.go:1972] Running: systemctl start rpm-ostreed 2023-01-04T10:56:50.855151883Z I0104 10:56:50.854666 18013 rpm-ostree.go:353] Running captured: rpm-ostree status --json 2023-01-04T10:56:50.899635929Z I0104 10:56:50.899574 18013 rpm-ostree.go:353] Running captured: rpm-ostree status --json 2023-01-04T10:56:50.941236704Z I0104 10:56:50.941179 18013 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:318187717bd19ef265000570d5580ea680dfbe99c3bece6dd180537a6f268f e1 (410.84.202210061459-0) 2023-01-04T10:56:50.973206073Z I0104 10:56:50.973131 18013 start.go:101] Copied self to /run/bin/machine-config-daemon on host 2023-01-04T10:56:50.973259966Z E0104 10:56:50.973196 18013 start.go:177] failed to load kubelet kubeconfig: open /etc/kubernetes/kubeconfig: no such file or directory 2023-01-04T10:56:50.975399571Z panic: runtime error: invalid memory address or nil pointer dereference 2023-01-04T10:56:50.975399571Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x173d84f] 2023-01-04T10:56:50.975399571Z 2023-01-04T10:56:50.975399571Z goroutine 1 [running]: 2023-01-04T10:56:50.975399571Z main.runStartCmd(2023-01-04T10:56:50.975436752Z 0x2c3da80?, {0x1bc0b3b?, 0x0?, 0x0?}) 2023-01-04T10:56:50.975436752Z /go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/start.go:179 +0x70f 2023-01-04T10:56:50.975436752Z github.com/spf13/cobra.(*Command).execute(0x2c3da80, {0x2c89310, 0x0, 0x0}) 2023-01-04T10:56:50.975436752Z /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663 2023-01-04T10:56:50.975448580Z github.com/spf13/cobra.(*Command).ExecuteC(0x2c3d580) 2023-01-04T10:56:50.975448580Z /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4 2023-01-04T10:56:50.975456464Z github.com/spf13/cobra.(*Command).Execute(...) 2023-01-04T10:56:50.975456464Z 2023-01-04T10:56:50.975464649Z /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:902 2023-01-04T10:56:50.975464649Z k8s.io/component-base/cli.Run(2023-01-04T10:56:50.975472575Z 0x2c3d580) 2023-01-04T10:56:50.975472575Z /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/component-base/cli/run.go:105 +0x385 2023-01-04T10:56:50.975485076Z main.main() 2023-01-04T10:56:50.975485076Z /go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/main.go:28 +0x25
Version-Release number of selected component (if applicable):
4.11.20
How reproducible:
Always
Steps to Reproduce:
1. Remove / change the name of the file "/etc/kubernetes/kubeconfig" 2. Delete machine-config-daemon pod 3.
Actual results:
2023-01-04T10:56:50.973259966Z E0104 10:56:50.973196 18013 start.go:177] failed to load kubelet kubeconfig: open /etc/kubernetes/kubeconfig: no such file or directory 2023-01-04T10:56:50.975399571Z panic: runtime error: invalid memory address or nil pointer dereference
Expected results:
Fatal error failed to load kubelet kubeconfig: open /etc/kubernetes/kubeconfig: no such file or directory but no runtime error
Additional info:
https://github.com/openshift/machine-config-operator/blob/92012a837e2ed0ed3c9e61c715579ac82ad0a464/cmd/machine-config-daemon/start.go#L179
Description of problem:
Installer get stuck at the beginning of installation if BYO private hosted zone is configured in install-config, from the CI logs, installer has no actions in 2 hours. Errors: level=info msg=Credentials loaded from the "default" profile in file "/var/run/secrets/ci.openshift.io/cluster-profile/.awscred" 185 {"component":"entrypoint","file":"k8s.io/test-infra/prow/entrypoint/run.go:164","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2023-03-05T16:44:27Z"}
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-23-000343
How reproducible:
Always
Steps to Reproduce:
1. Create an install-config.yaml, and config byo private hosted zone 2. Create the cluster
Actual results:
installer showed the following message and then get stuck, the cluster can not be created. level=info msg=Credentials loaded from the "default" profile in file "/var/run/secrets/ci.openshift.io/cluster-profile/.awscred"
Expected results:
create cluster successfully
Additional info:
Description of problem:
It's not currently possible to override the base image selected by the command: $ openshift-install agent create image Also defining the OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE variable does not have any effect
Version-Release number of selected component (if applicable):
4.14
How reproducible:
By defining the OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE when creating the image
Steps to Reproduce:
1. $ OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=<valid url to rhcos image> 2. $ openshift-install agent create image 3.
Actual results:
The agent ISO is built by using the embedded rhcos.json metadata, instead of the rhcos image specified in the OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE
Expected results:
Defining OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE should allow overriding the base image selected for creating the agent ISO
Additional info:
Description of the problem:
In staging, UI 2.18.2, BE 2.18.0 - Day2 add hosts - getting the following error when assigning auto-assign role:
Failed to set roleRequested
role (auto-assign) is invalid for host
c746e34f-f44a-4291-9064-402ab95b5831 from infraEnv
2b4ee2bf-ee45-4f57-b64e-715bc955f92e
How reproducible:
100%
Steps to reproduce:
1. install day1 cluster
2. In OCM, go to add host and discover new host
3. Assign auto-select role to this host
Actual results:
Expected results:
Description of the problem:
Please see Screening
Once installation started of cluster with valid custom manifest , manifest is no longer listable not mentioned in UI neither in cluster logs also when calling api/assisted-install/v2/clusters/{}/manifests
before installation manifest is listed , however after installation starts http api return error
How reproducible:
100%
Steps to reproduce:
1. created cluster with custom manifest
2. was able to see manifest in cluster details in installation page (before installation started)
3.also able to retrieve it via http get request
4. started installation
Actual results:
custom manifest no longer visible and not mentioned in logs
http get request returning above mentioned error (500)
it seems custom manifest was not added
Expected results:
manifest should still be visible and applied
Description of problem:
For https://issues.redhat.com//browse/OCPBUGS-4998, additional logging was added to the wait-for command when the state is in pending-user-action in order to show the particular host errors preventing installation. This additional host info should be added at the WARNING level.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Test this in the same as bug https://issues.redhat.com//browse/OCPBUGS-4998, i.e. by swapping the boot order of the disks 2. When the log message with additional info is logged it is logged at DEBUG level, for example DEBUG Host master-2 Expected the host to boot from disk, but it booted the installation image - please reboot and fix boot order to boot from disk Virtual_disk 6000c295b246decdbb4f4e691c185fcf (sda, /dev/disk/by-id/wwn-0x6000c295b246decdbb4f4e691c185fcf)INFO cluster has stopped installing... working to recover installation 3. This has now been changed to log at WARNING level 4. In addition multiple messages are logged: "level=info msg=cluster has stopped installing... working to recover installation". This will change to only log it one time.
Actual results:
Expected results:
1. The message is now logged at WARNING level 2. Only one message for "cluster has stopped installing... working to recover installation" will appear
Additional info:
From a recent PR run of the recovery suite:
> event happened 49 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator hmsg/593a6eb603 - pathological/true reason/UnstartedEtcdMember unstarted members: NAME-PENDING-10.0.167.169 From: 10:39:53Z To: 10:39:54Z result=reject
Since the remainder of the test has passed, the event might not be reconciled correctly when a member is coming back in CEO. We should fix this event.
This is a clone of issue OCPBUGS-19492. The following is the description of the original issue:
—
Description of problem:
Keepalived constantly fails on bootstrap causing installation failure
Seems like it doesn't have keepalived.conf file and keepalived monitor fails on
Version-Release number of selected component (if applicable):
4.13.12
How reproducible:
Regular installation through assisted installer
Steps to Reproduce:
1. 2. 3.
Actual results:
keepalived fails to start
Expected results:
Success
Additional info:
*
Extend multus resource collection so that we gather all resources on a per namespace basis with oc adm inspect.
This way, users can create a combined must-gather with all resources in one place.
We might have to revisit this once the reconciler and other changes land in more recent version of multus, but for the time being I think that this is a good change to make that we can also bp to older versions
Due to removal of in-tree AWS provider https://github.com/kubernetes/kubernetes/pull/115838 we need to ensure that KCM is setting --external-cloud-volume-plugin flag accordingly, especially that the CSI migration was GA-ed in 4.12/1.25.
The original PR that fixed this (https://github.com/openshift/cluster-kube-controller-manager-operator/pull/721) got reverted by mistake. We need to bring it back to unblock the kube rebase.
Description of problem:
When there is no public zone in dns zone, the look up will fail during install. During the installation of a private cluster, there is no need for a public zone.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
FATAL failed to fetch Terraform Variables: failed to generate asset "Terraform Variables": failed to get GCP public zone: no matching public DNS Zone found
Expected results:
Installation complete
Additional info:
Description of problem:
cluster-dns-operator startup has an error message: [controller-runtime] log.SetLogger(...) was never called, logs will not be displayed:
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Start cluster-dns-operator 2. oc edit dnses.operator.openshift.io default -> Change operatorLogLevel to "Trace" or "Debug" (it doesn't matter which, we just want to trigger an update) 3. Observe backtrace in logs
Actual results:
[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed: goroutine 201 [running]: runtime/debug.Stack() /usr/lib/golang/src/runtime/debug/stack.go:24 +0x65 sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot() /dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:59 +0xbd sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithValues(0xc0000bae40, {0xc000768ae0, 0x6, 0x6}) /dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:168 +0x54 github.com/go-logr/logr.Logger.WithValues(...) /dns-operator/vendor/github.com/go-logr/logr/logr.go:323 sigs.k8s.io/controller-runtime/pkg/controller.NewUnmanaged.func1(0xc000991980) /dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/controller/controller.go:121 +0x1f6 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003265a0, {0x1bddf28, 0xc00049d7c0}, {0x17b6120?, 0xc000991960?}) /dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:305 +0x18b sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003265a0, {0x1bddf28, 0xc00049d7c0}) /dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265 +0x1d9 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() /dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226 +0x85 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 /dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222 +0x587
Expected results:
No error message
Additional info:
This is due to 1.27 rebase: https://github.com/openshift/cluster-dns-operator/pull/368
Will require following
https://azure.github.io/azure-workload-identity/docs/installation/mutating-admission-webhook.html
Background
Please review the following PR: https://github.com/openshift/cluster-api-provider-gcp/pull/193
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/363
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
when run local bridge with auth disabled, we can see error GET http://localhost:9000/api/request-token 404 (Not Found)
Version-Release number of selected component (if applicable):
latest master
How reproducible:
Always
Steps to Reproduce:
1. fetch latest openshift/console code and build 2. run local bridge './bin/bridge' 3.
Actual results:
visiting localhost:9000 we can see errors GET http://localhost:9000/api/request-token 404 (Not Found)
Expected results:
maybe we should skip /api/request-token request when auth is disabled, as suggested in https://github.com/openshift/console/pull/12553#discussion_r1103151813
Additional info:
Nodes in Ironic are created following pattern <namespace>~<host name>.
However, when creating nodes in Ironic, baremetal-operator first creates them without a namespace, and only prepends the namespace prefix later. This open a possibility of node clashes, especially in the ACM context.
This is a clone of issue OCPBUGS-19313. The following is the description of the original issue:
—
As a user, I dont want to see the option of "DeploymentConfigs" in any form I am filling, when I have not installed the same in the cluster.
Description of problem:
The issue is regarding the Add Pipeline Checkbox. When there are 2 pipelines displayed in the dropdown menu, selecting one, unchecks the Add Pipeline checkbox.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always when 2 pipelines are in the ns
Steps to Reproduce:
1. Go to the Git Import Page. Create the application with Add Pipelines checked and a pipeline selected. 2. Go to the Serverless Function Page. Select Add Pipelines checkbox and try to select a pipeline from the drop-down.
Actual results:
The Add Pipelines checkbox automatically gets unchecked on selecting a Pipeline from the drop-down (in case of multiple pipelines in the dropdown)
Expected results:
The Add Pipelines checkbox must not get un-checked.
Additional info:
Video Link: https://drive.google.com/file/d/1OPRXbMw-EiihO3LAlDiOsh8qvhhiJK5H/view?usp=sharing
Description of problem:
In Agent TUI, setting IPV6 Configuration to Automatic and enabling Require IPV6 addressing for this connection generates a message saying that the feature is not supported. The user is allowed to quit the TUI (formally correct given that we select 'Quit' from the menu, I wonder if the 'Quit' options should remain greyed out until a valid config is applied? ) and the boot process proceeds using an unsupported/not working network configuration
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-07-131556
How reproducible:
Steps to Reproduce:
1. Feed the agent ISO with an agent-config.yaml file that defines an ipv6 only, static network configuration 2. Boot from the generated agent ISO, wait for the agent TUI to appear, select 'Edit a connection', than change Ipv6 configuration from Manual to Automatic, contextually enable the 'Require IPV6 addressing for this connection' option. Accept the changes. 3. (Not sure if this step is necessary) Once back in the main agent TUI screen, select 'Activate a connection'. Select the currently active connection, de-activate and re-activate it. 4. Go back to main agent TUI screen, select Quit
Actual results:
The agent TUI displays the following message than quits Failed to generate network state view: support for multiple default routes not yet implemented in agent-tui Once the TUI quits, the boot process proceeds
Expected results:
The TUI blocks the possibility to enable unsupported configurations The agent TUI informs the user about the unsupported configuration the moment it is applied (instead of informing the user the moment he selects 'Quit') and stays opened until a valid network configuration is applied The TUI should put the boot process on hold until a valid network config is applied
Additional info:
OCP Version: 4.13.0-0.nightly-2023-03-07-131556 agent-config.yaml snippet networkConfig: interfaces: - name: eno1 type: ethernet state: up mac-address: 34:73:5A:9E:59:10 ipv6: enabled: true address: - ip: 2620:52:0:1eb:3673:5aff:fe9e:5910 prefix-length: 64 dhcp: false
Description of problem:
I found an old shell error while checking logs. We don't quote a variable with [ -z ].
if [ -z $DHCP6_IP6_ADDRESS ] then >&2 echo "Not a DHCP6 address. Ignoring." exit 0 fi
Dec 05 12:05:02 master-0-2 nm-dispatcher[1365]: time="2022-12-05T12:05:02Z" level=debug msg="Ignoring filtered route {Ifindex: 10 Dst: fd2e:6f44:5dd8::59/128 Src: <nil> Gw: <nil> Flags: [] Table: 254}" Dec 05 12:05:02 master-0-2 nm-dispatcher[1365]: time="2022-12-05T12:05:02Z" level=debug msg="Ignoring filtered route {Ifindex: 10 Dst: fd2e:6f44:5dd8::5a/128 Src: <nil> Gw: <nil> Flags: [] Table: 254}" Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: req:19 'up' [br-ex], "/etc/NetworkManager/dispatcher.d/30-static-dhcpv6": run script Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + '[' -z fd2e:6f44:5dd8::5a fd2e:6f44:5dd8::59 ']' Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: /etc/NetworkManager/dispatcher.d/30-static-dhcpv6: line 4: [: fd2e:6f44:5dd8::5a: binary operator expected Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: ++ ip -j -6 a show br-ex Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: ++ jq -r '.[].addr_info[] | select(.scope=="global") | select(.deprecated!=true) | select(.local=="fd2e:6f44:5dd8::5a fd2e:6f44:5dd8::59") | .preferred_life_time' Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + LEASE_TIME= Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: ++ ip -j -6 a show br-ex Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: ++ jq -r '.[].addr_info[] | select(.scope=="global") | select(.deprecated!=true) | select(.local=="fd2e:6f44:5dd8::5a fd2e:6f44:5dd8::59") | .prefixlen' Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + PREFIX_LEN= Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + '[' 0 -lt 4294967295 ']' Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + echo 'Not an infinite DHCP6 lease. Ignoring.' Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: Not an infinite DHCP6 lease. Ignoring. Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: + exit 0 Dec 05 12:05:27 master-0-2 nm-dispatcher[1365]: req:19 'up' [
Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-11-30-111136
How reproducible:
Twice
Steps to Reproduce:
1. Somehow DHCPv6 provides two IPv6 leases 2. NetworkManager sets $DHCP6_IP6_ADDRESS to be all IPv6 address with spaces in-between 3. Bash error
Actual results:
/etc/NetworkManager/dispatcher.d/30-static-dhcpv6: line 4: [: fd2e:6f44:5dd8::5a: binary operator expected
Expected results:
shell inputs are sanitized or properly quoted.
Additional info:
This is a clone of issue OCPBUGS-19868. The following is the description of the original issue:
—
The cluster-version operator should not crash while trying to evaluate a bogus condition.
4.10 and later are exposed to the bug. It's possible that the OCPBUGS-19512 series increases exposure.
Unclear.
1. Create a cluster.
2. Point it at https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge.json (you may need to adjust version strings and digests for your test-cluster's release).
3. Wait around 30 minutes.
4. Point it at https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge-invalid-promql.json (again, may need some customization).
$ grep -B1 -A15 'too fresh' previous.log I0927 12:07:55.594222 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge-invalid-promql.json?arch=amd64&channel=stable-4.15&id=dc628f75-7778-457a-bb69-6a31a243c3a9&version=4.15.0-0.test-2023-09-27-091926-ci-ln-01zw7kk-latest I0927 12:07:55.726463 1 cache.go:118] {"type":"PromQL","promql":{"promql":"0 * group(cluster_version)"}} is the most stale cached cluster-condition match entry, but it is too fresh (last evaluated on 2023-09-27 11:37:25.876804482 +0000 UTC m=+175.082381015). However, we don't have a cached evaluation for {"type":"PromQL","promql":{"promql":"group(cluster_version_available_updates{channel=buggy})"}}, so attempt to evaluate that now. I0927 12:07:55.726602 1 cache.go:129] {"type":"PromQL","promql":{"promql":"0 * group(cluster_version)"}} is stealing this cluster-condition match call for {"type":"PromQL","promql":{"promql":"group(cluster_version_available_updates{channel=buggy})"}}, because its last evaluation completed 30m29.849594461s ago I0927 12:07:55.758573 1 cvo.go:703] Finished syncing available updates "openshift-cluster-version/version" (170.074319ms) E0927 12:07:55.758847 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 194 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1c4df00?, 0x32abc60}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc001489d40?}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x1c4df00, 0x32abc60}) /usr/lib/golang/src/runtime/panic.go:884 +0x213 github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql.(*PromQL).Match(0xc0004860e0, {0x220ded8, 0xc00041e550}, 0x0) /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql/promql.go:134 +0x419 github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache.(*Cache).Match(0xc0002d3ae0, {0x220ded8, 0xc00041e550}, 0xc0033948d0) /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache/cache.go:132 +0x982 github.com/openshift/cluster-version-operator/pkg/clusterconditions.(*conditionRegistry).Match(0xc000016760, {0x220ded8, 0xc00041e550}, {0xc0033948a0, 0x1, 0x0?})
No panics.
I'm still not entirely clear on how OCPBUGS-19512 would have increased exposure.
There are prometheus rules defined in the kubestate rules which trigger alerts for the `Kube*QuotaOvercommit` ,
These alerts are triggered when the sum of memory/CPU resource quotas for the default/kube-/openshift- namespaces exceed the capacity of the cluster.
Since there are no quotas defined inside default OCP projects and Cu is not expected to create any quota for the default ocp project having these alerts is not adding any value , it would be good to have them removed
This is a clone of issue OCPBUGS-18267. The following is the description of the original issue:
—
Description of problem:
'404: Not Found' will show on Knative-serving Details page
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-13-223353
How reproducible:
Always
Steps to Reproduce:
1. Installed 'Serveless' Operator, make sure the operator has been installed successfully, and the Knative Serving instance is created without any error 2. Navigate to Administration -> Cluster Settings -> Global Configuration 3. Go to Knative-serving Details page, check if 404 not found message is there 3.
Actual results:
Page will show 404 not found
Expected results:
the 404 not found page should not show
Additional info:
the dependency ticket is OCPBUGs-15008, more information could be checked in the comment
Description of problem:
When deploying KafkaMirrorMaker through OLM form (in AMQ Streams and Strimzi operator) we have to specify fields, which already have defaults and are optional:
For all other components it's correct.
Version-Release number of selected component (if applicable):
4.6
4.7
4.8
4.9
How reproducible:
Steps to Reproduce:
1. Deploy Strimzi 0.27.0 or AMQ Streams 1.8.4 via OLM
2. Try to deploy KafkaMirrorMaker via Form view without any changes
Actual results:
CR cannot be created because several required fields (all are in Liveness probe, Readiness probe and Tracing part) are not filled.
Expected results:
CR will be created, because all required fields are set (whitelist/include, kafka bootstrap address and replicas count, nothing else is needed)
Additional info:
openshift-azure-routes.path has the following [Path] section:
[Path]
PathExistsGlob=/run/cloud-routes/*
PathChanged=/run/cloud-routes/
MakeDirectory=true
There was a change in systemd that re-checks the files watched with PathExistsGlob once the service finishes:
With this commit, systemd rechecks all paths specs whenever the triggered unit deactivates. If any PathExists=, PathExistsGlob= or DirectoryNotEmpty= predicate passes, the triggered unit is reactivated
This means that openshift-azure-routes will get triggered all the time as long there are files in /run/cloud-routes.
Description of problem:
Backport https://github.com/kubernetes/kubernetes/pull/117371
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Web-terminal tests are constantly failing on CI. Disable them till they are fixed.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-console-master-e2e-gcp-console https://search.ci.openshift.org/?search=Web+Terminal+for+Admin+user&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Expected results:
Additional info:
Description of problem:
kubevirt digest missing from RHCOS boot image
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Unable to create kubevirt cluster
Expected results:
Able to create kubevirt cluster
Additional info:
Description of problem:
aws-proxy jobs are failing with workers unable to come up. Example job run[1]. On the console, the workers report 500 errors trying to retrieve the worker ignition[2]. Is it possible https://github.com/openshift/machine-config-operator/pull/3662 broke things? See logs below. [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-proxy/1648560213655031808 [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-proxy/1648560213655031808/artifacts/e2e-aws-ovn-proxy/gather-aws-console/artifacts/i-071b5af3ddb12e55c
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Install with a proxy
Actual results:
No workers come up
Expected results:
Additional info:
Logs are reporting: 2023-04-19T12:29:38.244051716Z I0419 12:29:38.244006 1 container_runtime_config_controller.go:415] Error syncing image config openshift-config: could not get ControllerConfig controllerconfig.machineconfiguration.openshift .io "machine-config-controller" not found 2023-04-19T12:29:56.507515526Z I0419 12:29:56.507472 1 render_controller.go:377] Error syncing machineconfigpool worker: controllerconfig.machineconfiguration.openshift.io "machine-config-controller" not found ./pods/machine-config-operator-6d7c6c8ccf-m7c57/machine-config-operator/machine-config-operator/logs/current.log:2023-04-19T12:38:15.240508503Z E0419 12:38:15.240437 1 operator.go:342] ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.proxy.apiVersion: Required value: must not be empty, spec.proxy.kind: Required value: must not be empty, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
csi-snapshot-controller ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.
Description of problem:
Configure diskEncryptionSet as below in install-config.yaml, and not set subscriptionID as it is optional parameter. install-config.yaml -------------------------------- compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: azure: encryptionAtHost: true osDisk: diskEncryptionSet: resourceGroup: jima07a-rg name: jima07a-des replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: encryptionAtHost: true osDisk: diskEncryptionSet: resourceGroup: jima07a-rg name: jima07a-des replicas: 3 platform: azure: baseDomainResourceGroupName: os4-common cloudName: AzurePublicCloud outboundType: Loadbalancer region: centralus defaultMachinePlatform: osDisk: diskEncryptionSet: resourceGroup: jima07a-rg name: jima07a-des Then create manifests file and create cluster, installer failed with error: $ ./openshift-install create cluster --dir ipi --log-level debug ... INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet: Invalid value: azure.DiskEncryptionSet{SubscriptionID:"", ResourceGroup:"jima07a-rg", Name:"jima07a-des"}: failed to get disk encryption set: compute.DiskEncryptionSetsClient#Get: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="InvalidSubscriptionId" Message="The provided subscription identifier 'resourceGroups' is malformed or invalid." Checked manifest file cluster-config.yaml, and found that subscriptionId is not filled out automatically under defaultMachinePlatform $ cat cluster-config.yaml apiVersion: v1 data: install-config: | additionalTrustBundlePolicy: Proxyonly apiVersion: v1 baseDomain: qe.azure.devcluster.openshift.com compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: azure: encryptionAtHost: true osDisk: diskEncryptionSet: name: jima07a-des resourceGroup: jima07a-rg subscriptionId: 53b8f551-f0fc-4bea-8cba-6d1fefd54c8a diskSizeGB: 0 diskType: "" osImage: offer: "" publisher: "" sku: "" version: "" type: "" replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: encryptionAtHost: true osDisk: diskEncryptionSet: name: jima07a-des resourceGroup: jima07a-rg subscriptionId: 53b8f551-f0fc-4bea-8cba-6d1fefd54c8a diskSizeGB: 0 diskType: "" osImage: offer: "" publisher: "" sku: "" version: "" type: "" replicas: 3 metadata: creationTimestamp: null name: jimadesa networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.0.0.0/16 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 platform: azure: baseDomainResourceGroupName: os4-common cloudName: AzurePublicCloud defaultMachinePlatform: osDisk: diskEncryptionSet: name: jima07a-des resourceGroup: jima07a-rg diskSizeGB: 0 diskType: "" osImage: offer: "" publisher: "" sku: "" version: "" type: "" outboundType: Loadbalancer region: centralus publish: External It works well when setting disk encryption set without subscriptionId under defalutMachinePlatform or controlPlane/compute.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-05-104719
How reproducible:
Always on 4.11, 4.12, 4.13
Steps to Reproduce:
1. Prepare install-config, configure diskEncrpytionSet under defaultMchinePlatform, controlPlane and compute without subscriptionId 2. Install cluster 3.
Actual results:
Cluster is installed successfully
Expected results:
installer failed
Additional info:
Description of problem:
OCP installer's OpenStack Ironic iRMC driver doesn'e work with FIPS mode enabled, as it requires SNMP version to be set to v3. However, there is no way to set the SNMP version parameter in the RHOCP installer yaml file, so it falls back to default v2, and it fails 100% of the time.
Version-Release number of selected component (if applicable):
Release Number: 14.0-ec.0 Drivers or hardware or architecture dependency: Deploy baremetal node with BMC using iRMC protocol(When RHOCP installer uses OpenStack Ironic iRMC driver) Hardware configuration: Model/Hypervisor: PRIMERGY RX2540 M6 CPU Info: Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz Memory Info: 125G Hardware Component Information: None Configuration Info: None Guest Configuration Info: None
How reproducible:
Always
Steps to Reproduce:
1. Enable FIPS mode of RHOCP nodes through setting "fips" to "true" at install-config.yaml. 2. In install-config.yaml, set platform.baremetal.hosts.bmc.address to start with 'irmc://' 3. Run OpenShift Container Platform installer.
Actual results:
OpenStack Ironic iRMC driver used in OpenShift Container Platform installer doesn't work and installation fails. Log message suggests setting SNMP version parameter of Ironic iRMC driver to v3 (non-default value) under FIPS mode enabled.
Expected results:
When FIPS mode is enabled on RHOCP, OpenStack Ironic iRMC driver used in RHOCP installer checks whether iRMC driver is configured to use SNMP (current OCP installer configures iRMC driver not to use SNMP) and if iRMC driver is configured not to use SNMP, driver doesn't require setting SNMP version parameter to v3 and installation proceeds. If iRMC driver is configured to use SNMP, driver requires setting SNMP version parameter to v3.
Additional info:
When FIPS mode is enabled, installation of RHOCP into Fujitsu server fails because OpenStack Ironic iRMC driver, which is used in RHOCP installer, requires iRMC driver's SNMP version parameter to be set to v3 even though iRMC driver isn't configured to use SNMP and there is no way to set it to v3. Installing RHOCP with IPI to baremetal node uses install-config.yaml. User sets configuration related to RHOCP in install-config.yaml. This installation uses OpenStack Ironic internally and values in install-config.yaml affect behavior of Ironic. During installation, Ironic connects to BMC(Baseboard management controller) and does operation related to RHOCP installation (e.g. power management). Ironic uses iRMC driver to operate on Fujitsu server's BMC. And iRMC driver checks iRMC-driver-specific Ironic parameters stored at Ironic component. When FIPS is enabled (i.e. "fips" is set to "true" in install-config.yaml), iRMC driver checks whether SNMP version specified in Ironic parameter to be set to v3 even though iRMC driver isn't configured to use SNMP internally. Currently, default value of SNMP version parameter of Ironic, which is iRMC driver specific parameter, is v2c and not v3. And iRMC driver fails with error if SNMP version is set to other than v3 when FIPS enabled. However, there is no way to set SNMP version parameter in RHOCP and that parameter is set to v2c by default. So when FIPS is enabled, deployment of OpenShift to Fujitsu server always fails. Cause of problem is, when FIPS is enabled, iRMC driver always requires SNMP version parameter to be set to v3 even though iRMC driver is not configured to use SNMP (current RHOCP installer configures iRMC driver not to use SNMP). To solve this problem, iRMC driver should be modified to check whether iRMC driver is configured to use SNMP internally and, if iRMC driver is configured to use SNMP and FIPS is enabled, requires SNMP version parameter to be set to v3. Such modification patch is already submitted to OpenStack Ironic community[1]. Summary of actions taken to resolve issue: Use OpenStack Ironic iRMC driver which incorporates bug fix patch[1] submitted on OpenStack Ironic community. [1] https://review.opendev.org/c/openstack/ironic/+/881358
Description of problem:
Currently PowerVS uses a DefaultMachineCIDR: 192.168.0.0/24 This will create network conflicts if another cluster is created in the same zone.
Version-Release number of selected component (if applicable):
current master branch
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The fix is to use a random number for DefaultMachineCIDR: 192.168.%d.0/24 This should significantly reduce the chances for collisions.
This is a clone of issue OCPBUGS-13829. The following is the description of the original issue:
—
Description of problem:
The configured accessTokenInactivityTimeout under tokenConfig in HostedCluster doesn't have any effect. 1. The value is not getting updated in oauth-openshift configmap 2. hostedcluster allows user to set accessTokenInactivityTimeout value < 300s, where as in master cluster the value should be > 300s.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Install a fresh 4.13 hypershift cluster 2. Configure accessTokenInactivityTimeout as below: $ oc edit hc -n clusters ... spec: configuration: oauth: identityProviders: ... tokenConfig: accessTokenInactivityTimeout: 100s ... 3. Check the hcp: $ oc get hcp -oyaml ... tokenConfig: accessTokenInactivityTimeout: 1m40s ... 4. Login to guest cluster with testuser-1 and get the token $ oc login https://a8890bba21c9b48d4a05096eee8d4edd-738276775c71fb8f.elb.us-east-2.amazonaws.com:6443 -u testuser-1 -p xxxxxxx $ TOKEN=`oc whoami -t` $ oc login --token="$TOKEN" WARNING: Using insecure TLS client config. Setting this option is not supported! Logged into "https://a8890bba21c9b48d4a05096eee8d4edd-738276775c71fb8f.elb.us-east-2.amazonaws.com:6443" as "testuser-1" using the token provided. You don't have any projects. You can try to create a new project, by running oc new-project <projectname>
Actual results:
1. hostedcluster will allow user to set the value < 300s for accessTokenInactivityTimeout which is not possible on master cluster. 2. The value is not updated in oauth-openshift configmap: $ oc get cm oauth-openshift -oyaml -n clusters-hypershift-ci-25785 ... tokenConfig: accessTokenMaxAgeSeconds: 86400 authorizeTokenMaxAgeSeconds: 300 ... 3. Login doesn't fail even if the user is not active for more than the set accessTokenInactivityTimeout seconds.
Expected results:
Login fails if the user is not active within the accessTokenInactivityTimeout seconds.
Description of problem:
administrator console UI, admin user goes to "Workloads -> Pods", select one project, example: openshift-console, select one pod and go to Pod details page, click "Metrics" tab, then click on "Network in" or "Network out" graph, it will show the prometheus expression, would find there are spaces before and after "pod_network_name_info", it's "( pod_network_name_info )", "pod_network_name_info" is enough
"Network in" expression
(sum(irate(container_network_receive_bytes_total{pod='console-5f4978747c-vmxqf', namespace='openshift-console'}[5m])) by (pod, namespace, interface)) + on(namespace,pod,interface) group_left(network_name) ( pod_network_name_info )
"Network out" expression
(sum(irate(container_network_transmit_bytes_total{pod='console-5f4978747c-vmxqf', namespace='openshift-console'}[5m])) by (pod, namespace, interface)) + on(namespace,pod,interface) group_left(network_name) ( pod_network_name_info )
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-05-19-234822
How reproducible:
always
Steps to Reproduce:
1. see the description 2. 3.
Actual results:
there are spaces before and after pod_network_name_info
Expected results:
no additional spaces
Additional info:
the bug does not have functional impact
Description of problem:
Using agent-config.yaml with DHCP network mode (i.e. without 'hosts' property), throws this error when loading the config-image: load-config-iso.sh[1656]: Expected file /etc/assisted/manifests/nmstateconfig.yaml is not in archive
Version-Release number of selected component (if applicable):
4.14 (master)
How reproducible:
100%
Steps to Reproduce:
1. Create an agent-config.yaml without 'hosts' property. 2. Generate a config-image. 3. Boot the machine and mount the ISO.
Actual results:
Installation can't continue due to an error on config-iso load: load-config-iso.sh[1656]: Expected file /etc/assisted/manifests/nmstateconfig.yaml is not in archive
Expected results:
The installation should continue as normal.
Additional info:
The issue is probably due to a fix introduced for static networking: https://issues.redhat.com/browse/OCPBUGS-15637 I.e. since '/etc/assisted/manifests/nmstateconfig.yaml' was added to GetConfigImageFiles, it's now mandatory on load-config.iso.sh (see 'copy_archive_contents' func). The failure was missed on dev-scripts tests probably due to this issue: https://github.com/openshift-metal3/dev-scripts/pull/1551
Description of problem:
https://github.com/kubernetes/kubernetes/issues/118916
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. compare memory usage from v1 and v2 and notice differences with the same workloads 2. 3.
Actual results:
they slightly differ because of accounting differences
Expected results:
they should be largely the same
Additional info:
Description of problem:
Since the operator watches plugins to enable dynamic plugins, it should list that resource under `status.relatedObjects` in its ClusterOperator.
Additional info:
Migrated from bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044588
Description of problem:
In a fresh installed cluster, we can see hot-loopings on Service openshift-monitoring/cluster-monitoring-operator.
Looking at the CronJob hot-looping
# grep -A60 'Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff' cvo2.log | tail -n61 I0110 06:32:44.489277 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ Object: map[string]interface{}{ "apiVersion": string("batch/v1"), "kind": string("CronJob"), "metadata": map[string]interface{}{"annotations": map[string]interface{}{"include.release.openshift.io/ibm-cloud-managed": string("true"), "include.release.openshift.io/self-managed-high-availability": string("true")}, "creationTimestamp": string("2022-01-10T04:35:19Z"), "generation": int64(1), "managedFields": []interface{}{map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:metadata": map[string]interface{}{"f:annotations": map[string]interface{}{".": map[string]interface{}{}, "f:include.release.openshift.io/ibm-cloud-managed": map[string]interface{}{}, "f:include.release.openshift.io/self-managed-high-availability": map[string]interface{}{}}, "f:ownerReferences": map[string]interface{}{".": map[string]interface{}{}, `k:{"uid":"334d6c04-126d-4271-96ec-d303e93b7d1c"}`: map[string]interface{}{}}}, "f:spec": map[string]interface{}{"f:concurrencyPolicy": map[string]interface{}{}, "f:failedJobsHistoryLimit": map[string]interface{}{}, "f:jobTemplate": map[string]interface{}{"f:spec": map[string]interface{}{"f:template": map[string]interface{}{"f:spec": map[string]interface{}{"f:containers": map[string]interface{}{`k:{"name":"collect-profiles"}`: map[string]interface{}{".": map[string]interface{}{}, "f:args": map[string]interface{}{}, "f:command": map[string]interface{}{}, "f:image": map[string]interface{}{}, ...}}, "f:dnsPolicy": map[string]interface{}{}, "f:priorityClassName": map[string]interface{}{}, "f:restartPolicy": map[string]interface{}{}, ...}}}}, "f:schedule": map[string]interface{}{}, ...}}, "manager": string("cluster-version-operator"), ...}, map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:status": map[string]interface{}{"f:lastScheduleTime": map[string]interface{}{}, "f:lastSuccessfulTime": map[string]interface{}{}}}, "manager": string("kube-controller-manager"), ...}}, ...}, "spec": map[string]interface{}{ + "concurrencyPolicy": string("Allow"), + "failedJobsHistoryLimit": int64(1), "jobTemplate": map[string]interface{}{ + "metadata": map[string]interface{}{"creationTimestamp": nil}, "spec": map[string]interface{}{ "template": map[string]interface{}{ + "metadata": map[string]interface{}{"creationTimestamp": nil}, "spec": map[string]interface{}{ "containers": []interface{}{ map[string]interface{}{ ... // 4 identical entries "name": string("collect-profiles"), "resources": map[string]interface{}{"requests": map[string]interface{}{"cpu": string("10m"), "memory": string("80Mi")}}, + "terminationMessagePath": string("/dev/termination-log"), + "terminationMessagePolicy": string("File"), "volumeMounts": []interface{}{map[string]interface{}{"mountPath": string("/etc/config"), "name": string("config-volume")}, map[string]interface{}{"mountPath": string("/var/run/secrets/serving-cert"), "name": string("secret-volume")}}, }, }, + "dnsPolicy": string("ClusterFirst"), "priorityClassName": string("openshift-user-critical"), "restartPolicy": string("Never"), + "schedulerName": string("default-scheduler"), + "securityContext": map[string]interface{}{}, + "serviceAccount": string("collect-profiles"), "serviceAccountName": string("collect-profiles"), + "terminationGracePeriodSeconds": int64(30), "volumes": []interface{}{ map[string]interface{}{ "configMap": map[string]interface{}{ + "defaultMode": int64(420), "name": string("collect-profiles-config"), }, "name": string("config-volume"), }, map[string]interface{}{ "name": string("secret-volume"), "secret": map[string]interface{}{ + "defaultMode": int64(420), "secretName": string("pprof-cert"), }, }, }, }, }, }, }, "schedule": string("*/15 * * * *"), + "successfulJobsHistoryLimit": int64(3), + "suspend": bool(false), }, "status": map[string]interface{}{"lastScheduleTime": string("2022-01-10T06:30:00Z"), "lastSuccessfulTime": string("2022-01-10T06:30:11Z")}, }, } I0110 06:32:44.499764 1 sync_worker.go:771] Done syncing for cronjob "openshift-operator-lifecycle-manager/collect-profiles" (574 of 765) I0110 06:32:44.499814 1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (575 of 765)
Extract the manifest:
# cat 0000_50_olm_07-collect-profiles.cronjob.yaml apiVersion: batch/v1 kind: CronJob metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" name: collect-profiles namespace: openshift-operator-lifecycle-manager spec: schedule: "*/15 * * * *" jobTemplate: spec: template: spec: serviceAccountName: collect-profiles priorityClassName: openshift-user-critical containers: - name: collect-profiles image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a8d116943a7c1eb32cd161a0de5cb173713724ff419a03abe0382a2d5d9c9a7 imagePullPolicy: IfNotPresent command: - bin/collect-profiles args: - -n - openshift-operator-lifecycle-manager - --config-mount-path - /etc/config - --cert-mount-path - /var/run/secrets/serving-cert - olm-operator-heap-:https://olm-operator-metrics:8443/debug/pprof/heap - catalog-operator-heap-:https://catalog-operator-metrics:8443/debug/pprof/heap volumeMounts: - mountPath: /etc/config name: config-volume - mountPath: /var/run/secrets/serving-cert name: secret-volume resources: requests: cpu: 10m memory: 80Mi volumes: - name: config-volume configMap: name: collect-profiles-config - name: secret-volume secret: secretName: pprof-cert restartPolicy: Never
Looking at the in-cluster object:
# oc get cronjob.batch/collect-profiles -oyaml -n openshift-operator-lifecycle-manager apiVersion: batch/v1 kind: CronJob metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" creationTimestamp: "2022-01-10T04:35:19Z" generation: 1 name: collect-profiles namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 334d6c04-126d-4271-96ec-d303e93b7d1c resourceVersion: "450801" uid: d0b92cd3-3213-466c-921c-d4c4c77f7a6b spec: concurrencyPolicy: Allow failedJobsHistoryLimit: 1 jobTemplate: metadata: creationTimestamp: null spec: template: metadata: creationTimestamp: null spec: containers: - args: - -n - openshift-operator-lifecycle-manager - --config-mount-path - /etc/config - --cert-mount-path - /var/run/secrets/serving-cert - olm-operator-heap-:https://olm-operator-metrics:8443/debug/pprof/heap - catalog-operator-heap-:https://catalog-operator-metrics:8443/debug/pprof/heap command: - bin/collect-profiles image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a8d116943a7c1eb32cd161a0de5cb173713724ff419a03abe0382a2d5d9c9a7 imagePullPolicy: IfNotPresent name: collect-profiles resources: requests: cpu: 10m memory: 80Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /etc/config name: config-volume - mountPath: /var/run/secrets/serving-cert name: secret-volume dnsPolicy: ClusterFirst priorityClassName: openshift-user-critical restartPolicy: Never schedulerName: default-scheduler securityContext: {} serviceAccount: collect-profiles serviceAccountName: collect-profiles terminationGracePeriodSeconds: 30 volumes: - configMap: defaultMode: 420 name: collect-profiles-config name: config-volume - name: secret-volume secret: defaultMode: 420 secretName: pprof-cert schedule: '*/15 * * * *' successfulJobsHistoryLimit: 3 suspend: false status: lastScheduleTime: "2022-01-11T03:00:00Z" lastSuccessfulTime: "2022-01-11T03:00:07Z"
Version-Release number of the following components:
4.10.0-0.nightly-2022-01-09-195852
How reproducible:
1/1
Steps to Reproduce:
1.Install a 4.10 cluster
2. Grep 'Updating .*due to diff' in the cvo log to check hot-loopings
3.
Actual results:
CVO hotloops on CronJob openshift-operator-lifecycle-manager/collect-profiles
Expected results:
CVO should not hotloop on it in a fresh installed cluster
Additional info:
attachment 1850058 CVO log file
Reproduced locally, the failure is:
level=error msg=Attempted to gather debug logs after installation failure: must provide bootstrap host address level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected level=error msg=Cluster operator network Degraded is True with ApplyOperatorConfig: Error while updating operator configuration: could not apply (rbac.authorization.k8s.io/v1, Kind=RoleBindi ng) openshift-config-managed/openshift-network-public-role-binding: failed to apply / update (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-publi c-role-binding: Patch "https://api-int.ostest.test.metalkube.org:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-config-managed/rolebindings/openshift-network-public-role-binding ?fieldManager=cluster-network-operator%2Foperconfig&force=true": dial tcp 192.168.111.5:6443: connect: connection refused
I haven't gone back to pin down all affected versions, but I wouldn't be surprised if we've had this exposure for a while. On a 4.12.0-ec.2 cluster, we have:
cluster:usage:resources:sum{resource="podnetworkconnectivitychecks.controlplane.operator.openshift.io"}
currently clocking in around 67983. I've gathered a dump with:
$ oc --as system:admin -n openshift-network-diagnostics get podnetworkconnectivitychecks.controlplane.operator.openshift.io | gzip >checks.gz
And many, many of these reference nodes which no longer exist (the cluster is aggressively autoscaled, with nodes coming and going all the time). We should fix garbage collection on this resource, to avoid consuming excessive amounts of memory in the Kube API server and etcd as they attempt to list the large resource set.
Description of problem:
machine config pool selection will be failed when single node has master+custom roles, controller logged the error but node is not marked as degraded, end user does not know this error. no config can be applied on the node
Version-Release number of selected component (if applicable):
4.12. 4.11.z
Steps to Reproduce:
1. setup SNO cluster 2. create custom mcp 3. add custom mcp label on the node 4. check mcc pod log to see the error message about pool selection 5. create mc to apply config
Actual results:
node state is good, the single node cannot be assigned to any mcp
Expected results:
node can be marked as degraded with error message
Additional info:
Description of problem:
Azure MAG install failed by Terraform error ‘Error ensuring Resource Providers are registered’
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-27-172239
How reproducible:
Always
Steps to Reproduce:
1. Create MAG Azure cluster with IPI
Actual results:
Fail to create the installer when ‘Creating infrastructure resources…’ In terraform.log: 2023-07-29T11:33:02.938Z [ERROR] provider.terraform-provider-azurerm: Response contains error diagnostic: @module=sdk.proto tf_proto_version=5.3 tf_provider_addr=provider tf_req_id=45c10824-360b-b211-1ba1-9c3a722014af @caller=/go/src/github.com/openshift/installer/terraform/providers/azurerm/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:55 diagnostic_detail= diagnostic_severity=ERROR diagnostic_summary="Error ensuring Resource Providers are registered.Terraform automatically attempts to register the Resource Providers it supports to ensure it's able to provision resources.If you don't have permission to register Resource Providers you may wish to use the "skip_provider_registration" flag in the Provider block to disable this functionality.Please note that if you opt out of Resource Provider Registration and Terraform tries to provision a resource from a Resource Provider which is unregistered, then the errors may appear misleading - for example:> API version 2019-XX-XX was not found for Microsoft.FooCould indicate either that the Resource Provider "Microsoft.Foo" requires registration, but this could also indicate that this Azure Region doesn't support this API version.More information on the "skip_provider_registration" flag can be found here: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs#skip_provider_registrationOriginal Error: determining which Required Resource Providers require registration: the required Resource Provider "Microsoft.CustomProviders" wasn't returned from the Azure API" tf_rpc=Configure timestamp=2023-07-29T11:33:02.937Z 2023-07-29T11:33:02.938Z [ERROR] vertex "provider[\"openshift/local/azurerm\"]" error: Error ensuring Resource Providers are registered.Terraform automatically attempts to register the Resource Providers it supports to ensure it's able to provision resources.If you don't have permission to register Resource Providers you may wish to use the "skip_provider_registration" flag in the Provider block to disable this functionality.Please note that if you opt out of Resource Provider Registration and Terraform tries to provision a resource from a Resource Provider which is unregistered, then the errors may appear misleading - for example:> API version 2019-XX-XX was not found for Microsoft.FooCould indicate either that the Resource Provider "Microsoft.Foo" requires registration, but this could also indicate that this Azure Region doesn't support this API version.More information on the "skip_provider_registration" flag can be found here: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs#skip_provider_registrationOriginal Error: determining which Required Resource Providers require registration: the required Resource Provider "Microsoft.CustomProviders" wasn't returned from the Azure API
Expected results:
Create the installer should succeed.
Additional info:
Suspect that issue with https://github.com/openshift/installer/pull/7205/, IPI install on Azure MAG with 4.14.0-0.nightly-2023-07-27-051258 is OK
In many cases, the /dev/disk/by-path symlink is the only way to stably identify a disk without having prior knowledge of the hardware from some external source (e.g. a spreadsheet of disk serial numbers). It should be possible to specify this path in the root device hints.
Metal³ is planning to allow these paths in the `name` hint (see OCPBUGS-13080), and assisted's implementation of root device hints (which is used in ZTP and the agent-based installer) should be changed to match.
Description of problem:
console-operator may panic when IncludeNamesFilter receives an object from a shared informer event of type cache.DeletedFinalStateUnknown. Example job with panic: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1687876857824808960 Specific log that shows the full stack trace: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1687876857824808960/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods/openshift-console-operator_console-operator-748d7c6cdd-vwxmx_console-operator.log
Version-Release number of selected component (if applicable):
How reproducible:
Sporadically
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
Assisted installer namespace `assisted-installer` is not compliant with the `ocp4-cis-configure-network-policies-namespaces` Compliance Operator scan.
How reproducible:
Everytime
Steps to reproduce:
1. Install a cluster with Assisted Intaller
2. Confirm the `assisted-installer` Namespace is present and not removed
3. Install the Red Hat Compliance Operator
4. Run a compliance scan using the `ocp4-cis`
Actual results:
Cluster fails the scan with the following warning
```
Ensure that application Namespaces have Network Policies defined high
fail
```
Expected results:
Cluster does not fail the scan
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/28
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Cluster Monitoring Operator (CMO) lacks golangci-lint checking and has several violations for linters. The ones we'd be specifically interested into are the staticcheck ones as they are tied to deprecated libraries in go.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Links for both markdown documents in console-dynamic-plugin-sdk/docs are not working. Check https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Clicking on a link in any markdown doc is not taking user to the appropriate section.
Expected results:
Clicking on a link in any markdown doc should take user to the appropriate section.
Additional info:
Description of problem:
We have observed a situation where: - A workload mounting multiple EBS volumes gets stuck in a Terminating state when it finishes. - The node that the workload ran on eventually gets stuck draining, because it gets stuck on unmounting one of the volumes from that workload, despite no containers from the workload now running on the node. What we observe via the node logs is that the volume seems to unmount successfully. Then it attempts to unmount a second time, unsuccessfully. This unmount attempt then repeats and holds up the node. Specific examples from the node's logs to illustrate this will be included in a private comment.
Version-Release number of selected component (if applicable):
4.11.5
How reproducible:
Has occurred on four separate nodes on one specific cluster, but the mechanism to reproduce it is not known.
Steps to Reproduce:
1. 2. 3.
Actual results:
A volume gets stuck unmounting, holding up removal of the node and completed deletion of the pod.
Expected results:
The volume should not get stuck unmounting.
Additional info:
CI is flaky because the TestAWSELBConnectionIdleTimeout test fails. Example failures:
I have seen these failures in 4.14 and 4.13 CI jobs.
Presently, search.ci reports the following stats for the past 14 days:
Found in 1.24% of runs (3.52% of failures) across 404 total runs and 34 jobs (35.15% failed)
This includes two jobs:
1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=FAIL%3A+TestAll%2Fparallel%2FTestAWSELBConnectionIdleTimeout&maxAge=336h&context=1&type=all&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.
The test fails because it times out waiting for DNS to resolve:
=== RUN TestAll/parallel/TestAWSELBConnectionIdleTimeout operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2656: failed to observe expected condition: timed out waiting for the condition panic.go:522: deleted ingresscontroller test-idle-timeout
The above output comes from build-log.txt from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/917/pull-ci-openshift-cluster-ingress-operator-release-4.13-e2e-aws-operator/1658840125502656512.
CI passes, or it fails on a different test.
Description of problem:
'hostedcluster.spec.configuration.ingress.loadBalancer.platform.aws.type' is ignored
Version-Release number of selected component (if applicable):
How reproducible:
set field to 'NLB'
Steps to Reproduce:
1. set the field to 'NLB' 2. 3.
Actual results:
a classic load balancer is created
Expected results:
Should create a Network load balancer
Additional info:
Since the change we did on https://github.com/openshift/assisted-test-infra/pull/1989, whenever deploying assisted installer services using "make run" or "make deploy_assisted_service" we are deploying with only single image - the default one (e.g. OPENSHIFT_VERSION=4.13).
Description of problem:
EgressIP was NOT migrated to correct workers after deleting machine it was assigned in GCP XPN cluster.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-29-235439
How reproducible:
Always
Steps to Reproduce:
1. Set up GCP XPN cluster. 2. Scale two new worker nodes % oc scale --replicas=2 machineset huirwang-0331a-m4mws-worker-c -n openshift-machine-api machineset.machine.openshift.io/huirwang-0331a-m4mws-worker-c scaled 3. Wait the two new workers node ready. % oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE huirwang-0331a-m4mws-worker-a 1 1 1 1 86m huirwang-0331a-m4mws-worker-b 1 1 1 1 86m huirwang-0331a-m4mws-worker-c 2 2 2 2 86m huirwang-0331a-m4mws-worker-f 0 0 86m % oc get nodes NAME STATUS ROLES AGE VERSION huirwang-0331a-m4mws-master-0.c.openshift-qe.internal Ready control-plane,master 82m v1.26.2+dc93b13 huirwang-0331a-m4mws-master-1.c.openshift-qe.internal Ready control-plane,master 82m v1.26.2+dc93b13 huirwang-0331a-m4mws-master-2.c.openshift-qe.internal Ready control-plane,master 82m v1.26.2+dc93b13 huirwang-0331a-m4mws-worker-a-hfqsn.c.openshift-qe.internal Ready worker 71m v1.26.2+dc93b13 huirwang-0331a-m4mws-worker-b-vbqf2.c.openshift-qe.internal Ready worker 71m v1.26.2+dc93b13 huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal Ready worker 8m22s v1.26.2+dc93b13 huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal Ready worker 8m22s v1.26.2+dc93b13 3. Label one new worker node as egress node % oc label node huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal k8s.ovn.org/egress-assignable="" node/huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal labeled 4. Create egressIP object oc get egressIP NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 10.0.32.100 huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal 10.0.32.100 5. Label second new worker node as egress node % oc label node huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal k8s.ovn.org/egress-assignable="" node/huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal labeled 6. Delete the assigned egress node % oc delete machines.machine.openshift.io huirwang-0331a-m4mws-worker-c-rhbkr -n openshift-machine-api machine.machine.openshift.io "huirwang-0331a-m4mws-worker-c-rhbkr" deleted % oc get nodes NAME STATUS ROLES AGE VERSION huirwang-0331a-m4mws-master-0.c.openshift-qe.internal Ready control-plane,master 87m v1.26.2+dc93b13 huirwang-0331a-m4mws-master-1.c.openshift-qe.internal Ready control-plane,master 86m v1.26.2+dc93b13 huirwang-0331a-m4mws-master-2.c.openshift-qe.internal Ready control-plane,master 87m v1.26.2+dc93b13 huirwang-0331a-m4mws-worker-a-hfqsn.c.openshift-qe.internal Ready worker 76m v1.26.2+dc93b13 huirwang-0331a-m4mws-worker-b-vbqf2.c.openshift-qe.internal Ready worker 76m v1.26.2+dc93b13 huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal Ready worker 13m v1.26.2+dc93b13 29468 W0331 02:48:34.917391 1 egressip_healthcheck.go:162] Could not connect to huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal (10.129.4.2:9107): context deadline exceeded 29469 W0331 02:48:34.917417 1 default_network_controller.go:903] Node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal is not ready, deleting it from egre ss assignment 29470 I0331 02:48:34.917590 1 client.go:783] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:Logical_Switch_Port Row:map[o ptions:{GoMap:map[router-port:rtoe-GR_huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {6efd3c58-9458-44a2-a43b-e70e669efa72}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]" 29471 E0331 02:48:34.920766 1 egressip.go:993] Allocator error: EgressIP: egressip-1 assigned to node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal whi ch is not reachable, will attempt rebalancing 29472 E0331 02:48:34.920789 1 egressip.go:997] Allocator error: EgressIP: egressip-1 assigned to node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal whi ch is not ready, will attempt rebalancing 29473 I0331 02:48:34.920808 1 egressip.go:1212] Deleting pod egress IP status: {huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal 10.0.32.100} for EgressIP: egressip-1
Actual results:
The egressIP was not migrated to correct worker oc get egressIP NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 10.0.32.100 huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal 10.0.32.100
Expected results:
The egressIP should migrated to correct worker from deleted node.
Additional info:
Description of problem:
In order to test proxy installations, the CI base image for OpenShift on OpenStack needs netcat.
Description of problem:
Installation failed when setting featureSet: LatencySensitive or featureSet: CustomNoUpgrade. When setting featureSet: CustomNoUpgrade in install-config and create cluster.See below error info: [core@bootstrap ~]$ journalctl -b -f -u release-image.service -u bootkube.service Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]: github.com/spf13/cobra@v1.6.0/command.go:968 Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]: k8s.io/component-base/cli.run(0xc00025c300) Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]: k8s.io/component-base@v0.26.1/cli/run.go:146 +0x317 Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]: k8s.io/component-base/cli.Run(0x2ce59e8?) Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]: k8s.io/component-base@v0.26.1/cli/run.go:46 +0x1d Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]: main.main() Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670367]: github.com/openshift/cluster-kube-controller-manager-operator/cmd/cluster-kube-controller-manager-operator/main.go:24 +0x2c Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: bootkube.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: bootkube.service: Failed with result 'exit-code'. Apr 26 07:02:48 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: bootkube.service: Consumed 1.935s CPU time. Apr 26 07:02:54 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 343. Apr 26 07:02:54 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: Stopped Bootstrap a Kubernetes cluster. Apr 26 07:02:54 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: bootkube.service: Consumed 1.935s CPU time. Apr 26 07:02:54 bootstrap.wwei-426g.qe.devcluster.openshift.com systemd[1]: Started Bootstrap a Kubernetes cluster. Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[670489]: Rendering Kubernetes Controller Manager core manifests... Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: panic: interface conversion: interface {} is nil, not []interface {} Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: goroutine 1 [running]: Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/operator/targetconfigcontroller.GetKubeControllerManagerArgs(0xc000746100?) Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/operator/targetconfigcontroller/targetconfigcontroller.go:696 +0x379 Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render.(*renderOpts).Run(0xc0008d22c0) Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render/render.go:269 +0x85c Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render.NewRenderCommand.func1.1(0x0?) Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render/render.go:48 +0x32 Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render.NewRenderCommand.func1(0xc000bee600?, {0x285dffa?, 0x8?, 0x8?}) Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/openshift/cluster-kube-controller-manager-operator/pkg/cmd/render/render.go:58 +0xc8 Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/spf13/cobra.(*Command).execute(0xc000bee600, {0xc00071cb00, 0x8, 0x8}) Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/spf13/cobra@v1.6.0/command.go:920 +0x847 Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/spf13/cobra.(*Command).ExecuteC(0xc000bee000) Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/spf13/cobra@v1.6.0/command.go:1040 +0x3bd Apr 26 07:02:56 bootstrap.wwei-426g.qe.devcluster.openshift.com bootkube.sh[672314]: github.com/spf13/cobra.(*Command).Execute(...) When setting featureSet: LatencySensitive in install-config and create cluster.See below error info: [core@bootstrap ~]$ journalctl -b -f -u release-image.service -u bootkube.service Apr 26 07:07:09 bootstrap.wwei-426h.qe.devcluster.openshift.com bootkube.sh[16835]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : the server could not find the requested resource Apr 26 07:07:09 bootstrap.wwei-426h.qe.devcluster.openshift.com bootkube.sh[16835]: Failed to create "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n : the server could not find the requested resource Apr 26 07:07:09 bootstrap.wwei-426h.qe.devcluster.openshift.com bootkube.sh[16835]: [#1105] failed to create some manifests: Apr 26 07:07:09 bootstrap.wwei-426h.qe.devcluster.openshift.com bootkube.sh[16835]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : the server could not find the requested resource Apr 26 07:07:09 bootstrap.wwei-426h.qe.devcluster.openshift.com bootkube.sh[16835]: Failed to create "cluster-infrastructure-02-config.yml" infrastructures.v1.config.openshift.io/cluster -n : the server could not find the requested resource
Version-Release number of selected component (if applicable):
OCP version: 4.13.0-0.nightly-2023-04-21-084440
How reproducible:
always
Steps to Reproduce:
1.Create install-config.yaml like below(LatencySensitive) apiVersion: v1 controlPlane: architecture: amd64 hyperthreading: Enabled name: master replicas: 3 compute: - architecture: amd64 hyperthreading: Enabled name: worker replicas: 2 metadata: name: wwei-426h platform: none: {} pullSecret: xxxxx featureSet: LatencySensitive networking: clusterNetwork: - cidr: xxxxx hostPrefix: 23 serviceNetwork: - xxxxx networkType: OpenShiftSDN publish: External baseDomain: xxxxxx sshKey: xxxxxxx 2.Then continue to install the cluster: openshift-install create cluster --dir <install_folder> --log-level debug 3.Create install-config.yaml like below(CustomNoUpgrade): apiVersion: v1 controlPlane: architecture: amd64 hyperthreading: Enabled name: master replicas: 3 compute: - architecture: amd64 hyperthreading: Enabled name: worker replicas: 2 metadata: name: wwei-426h platform: none: {} pullSecret: xxxxx featureSet: CustomNoUpgrade networking: clusterNetwork: - cidr: xxxxx hostPrefix: 23 serviceNetwork: - xxxxx networkType: OpenShiftSDN publish: External baseDomain: xxxxxx sshKey: xxxxxxx 4.Then continue to install the cluster: openshift-install create cluster --dir <install_folder> --log-level debug
Actual results:
Installation failed.
Expected results:
Installation succeeded.
Additional info:
log-bundle can get from below link : https://drive.google.com/drive/folders/1kg1EeYR6ApWXbeRZTiM4DV205nwMfSQv?usp=sharing
Description of the problem:
Some validations are only related to agents that are bound to clusters. We had a case where an agent couldn't be bound due to failing validations, and the irrelevant validations added unnecessary noise. I attached the relevant agent CR to the ticket. You can see in the Conditions:
- lastTransitionTime: "2023-01-26T21:00:29Z" message: 'The agent''s validations are failing: Validation pending - no cluster,Host couldn''t synchronize with any NTP server,Missing inventory, or missing cluster' reason: ValidationsFailing status: "False" type: Validated
The only relevant validation is that there is no NTP server. "no cluster" and "Missing inventory, or missing cluster" are misleading.
How reproducible:
100%
Steps to reproduce:
1. Boot an unbound agent
2. Look at the CR
Actual results:
All validations are shown in the CR
Expected results:
Only relevant validations are shown in the CR
Description of problem:
Most recent nightly https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-04-18-152947 has a lot of OAuth test failures Example runs: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn/1648348911074545664 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm/1648348885556400128 Error looks like: fail [github.com/openshift/origin/test/extended/oauth/expiration.go:105]: Unexpected error: <*tls.CertificateVerificationError | 0xc0023b6330>: { UnverifiedCertificates: [ {... Looking at changes in the last day or so, nothing sticks out to me. Although I believed ART bumped everything to be built with go1.20 and this error is new to go1.20: "For a handshake failure due to a certificate verification failure, the TLS client and server now return an error of the new type CertificateVerificationError, which includes the presented certificates." - https://go.dev/doc/go1.20
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-04-18-152947
How reproducible:
Looks repeatable
Steps to Reproduce:
1. Build oauth, origin, and related containers with go1.20 (not clear which is causing the test failure) 2. 3.
Actual results:
Tests fail
Expected results:
Additional info:
Description of problem:
https://github.com/openshift/hypershift/pull/2437 added the ability to override image registries with CR ImageDigestMirrorSet; however, ImageDigestMirrorSet is only valid for 4.13+.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Install HO on Mgmt Cluster 4.12
Steps to Reproduce:
1. 2. 3.
Actual results:
failed to populate image registry overrides: no matches for kind "ImageDigestMirrorSet" in version "config.openshift.io/v1"
Expected results:
No errors and HyperShift doesn't try to use ImageDigestMirrorSet prior to 4.13.
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The ACLs are disabled for all newly created s3 buckets, this causes all OCP installs to fail: the bootstrap ignition can not be uploaded: level=info msg=Creating infrastructure resources... level=error level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs level=error msg= status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4= level=error level=error msg= with aws_s3_bucket_acl.ignition, level=error msg= on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition": level=error msg= 62: resource "aws_s3_bucket_acl" ignition { level=error level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1 level=error level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs level=error msg= status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4= level=error level=error msg= with aws_s3_bucket_acl.ignition, level=error msg= on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition": level=error msg= 62: resource "aws_s3_bucket_acl" ignition {
Version-Release number of selected component (if applicable):
4.11+
How reproducible:
Always
Steps to Reproduce:
1.Create a cluster via IPI
Actual results:
install fail
Expected results:
install succeed
Additional info:
Heads-Up: Amazon S3 Security Changes Are Coming in April of 2023 - https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/ https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-ownership-error-responses.html - After you apply the bucket owner enforced setting for Object Ownership, ACLs are disabled.
As IBM, I would like to replace flag --use-oci-feature flag with --include-oci-local-catalogs
--use-oci-feature is implying to users that this might be about using oci format for images rather than docker-v2, and this might be hard to understand and generate questions, bugs, and new misunderstood requests. For clarity, and before this feature goes GA, this flag will be replaced by --include-local-oci-catalog in 4.14. The --use-oci-feature will be marked deprecated in 4.13, and completely removed in 4.14
As an oc-mirror user I want a well documented and intuitive process
so that I can effectively and efficiently deliver image artifacts in both connected and disconnected installs with no impact on my current workflow
Glossary:
References:
Acceptance criteria:
Description of problem:
When using agent based installer to provision OCP on baremetal, some of the machine fail to use the static nmconnection files, and got ip address via DHCP. This may cause the network vaildaiton fails.
Version-Release number of selected component (if applicable):
4.13.3
How reproducible:
100%
Steps to Reproduce:
1. Generate agent iso 2. Mount it to BMC and reboot from live cd 3. Use openshift-install agent wait for to monitor the progress
Actual results:
network vaildation fails due to overlay ip address
Expected results:
vaildation success
Additional info:
Description of problem:
The dev console shows a list of samples. The user can create a sample based on a git repository. But some of these samples doesn't include a git repository reference and could not be created.
Version-Release number of selected component (if applicable):
Tested different frontend versions against a 4.11 cluster and all (oldest tested frontend was 4.8) show the sample without git repository.
But the result also depends on the installed samples operator and installed ImageStreams.
How reproducible:
Always
Steps to Reproduce:
Actual results:
The git repository is not filled and the create button is disabled.
Expected results:
Samples without git repositories should not be displayed in the list.
Additional info:
The Git repository is saved as "sampleRepo" in the ImageStream tag section.
Description of problem:
Arm HCP's are currently broken. The following error message was observed in the ignition-server pod: {"level":"error","ts":"2023-06-29T13:38:19Z","msg":"Reconciler error","controller":"secret","controllerGroup":"","controllerKind":"Secret","secret":{"name":"token-brcox-hypershift-arm-us-east-1a-dbe0ce2a","namespace":"clusters-brcox-hypershift-arm"},"namespace":"clusters-brcox-hypershift-arm","name":"token-brcox-hypershift-arm-us-east-1a-dbe0ce2a","reconcileID":"ff813140-d10a-464e-a1b0-c05859b64ef9","error":"error getting ignition payload: failed to execute cluster-config-operator: cluster-config-operator process failed: /bin/bash: line 21: /payloads/get-payload1590526115/bin/cluster-config-operator: cannot execute binary file: Exec format error\n: exit status 126","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal...
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Create an Arm Mgmt Cluster 2. Create an Arm HCP
Actual results:
Error message in ignition-server pod and failure to generate appropriate payload.
Expected results:
ignition-server picks the appropriate arch based on the mgmt cluster.
Additional info:
Testgrid for single-node-workers-upgrade-conformance shows that tests are failing due to the 'KubeMemoryOvercommit' alert.
We should avoid failing on this alert for single node environments assuming it's ok to overcommit memory on single node Openshift clusters.
Ref: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1687375398906129
Description of problem:
Fail to collect the vm serial log with ‘openshift-install gather bootstrap’
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-14-053612
How reproducible:
Always
Steps to Reproduce:
1.IPI install a private cluster, Once bootstrap node boot up, before it is terminated, 2. ssh to the bastion, then try to get bootstrap log $openshift-install gather bootstrap --key openshift-qe.pem --bootstrap 10.0.0.5 --master 10.0.0.7 –loglevel debug 3.
Actual results:
Fail to get the vm serial logs, in the output: … DEBUG Gather remote logs DEBUG Collecting info from 10.0.0.6 DEBUG scp: ./installer-masters-gather.sh: Permission denied EBUG Warning: Permanently added '10.0.0.6' (ECDSA) to the list of known hosts.…DEBUG Waiting for logs ... DEBUG Log bundle written to /var/home/core/log-bundle-20230317033401.tar.gz WARNING Unable to stat /var/home/core/serial-log-bundle-20230317033401.tar.gz, skipping INFO Bootstrap gather logs captured here "/var/home/core/log-bundle-20230317033401.tar.gz"
Expected results:
Get the vm serial log and in the log has not the above “WARNING Unable to stat…”
Additional info:
IPI install on local install, has the same issue. INFO Pulling VM console logs DEBUG attemping to download … INFO Failed to gather VM console logs: unable to download file: /root/temp/4.13.0-0.nightly-2023-03-14-053612/ipi/serial-log-bundle-20230317042338
We've had several forum cases and bugs already where a restart of the CEO was fixing issues that could be resolved automatically by a liveness probe.
We previously traced it down to stuck/deadlocked controllers, missing timeouts in grpc calls and other issues we haven't been able to find yet. Since the list of failures that can happen is pretty large, we should add a liveness probe to the CEO that will periodically health check:
This check should not indicate whether the etcd cluster itself is healthy, it's purely for the CEO itself.
Description of problem:
While creating the deployment, if image stream is added, then while edit-deployment save button will not be enabled until imagestream tag is changed. On click of Reload button Save button will be automatically enabled.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Search Deployment under resources 2. create deployment with Image stream 3. edit deployment
Actual results:
On click of edit deployment the save button is disabled on change of any values
Expected results:
On click of edit deployment the save button should be enabled on change of any value
Video Link - https://drive.google.com/file/d/1luqcjQS5Azc0XRjpMNfKKqbXYSc17Rxc/view?usp=share_link
Description of problem:
Cluster recently upgraded to OCP 4.12.19 experiencing serious slowness issues with Project>Project access page. The loading time of that page grows significantly faster than the number of entries, and is very noticeable even at a relatively low number of entries.
Version-Release number of selected component (if applicable):
4.12.19
How reproducible:
Easily
Steps to Reproduce:
1. Create a namespace, and add RoleBindings for multiple users, for instance with : $ oc -n test-namespace create rolebinding test-load --clusterrole=view --user=user01 --user=user02 --user=... 2. In Developer view of that namespace, navigate to "Project"->"Project access". The page will take a long time to load compared to the time an "oc get rolebinding" would take.
Actual results:
0 RB => instantaneous loading 40 RB => about 10 seconds until page loaded 100 RB => one try took 50 seconds, another 110 seconds 200 RB => nothing for 8 minutes, after which my web browser (Firefox) proposed to stop the page since it slowed the browser down, and after 10 minutes I stopped the attempt without ever seeing the page load.
Expected results:
Page should load almost instantly with only a few hundred role bindings
Run isVSphereDiskUUIDEnabled validation also on baremetal platform installation.
From the description of https://issues.redhat.com/browse/OCPBUGS-16955:
Storage team has observed that if disk.EnableUUID flag is not enabled on vSphere VMs in any platform, including baremetal, then no symlinks are generated in /dev/disk/by-id for attached disks.
Installing ODF via LSO or something on such a platform results in somewhat fragile installation because disks themselves could be renamed on reboot and since no permanent ids exists for disks, the PVs could become invalid.
We should update baremetal installs - https://docs.openshift.com/container-platform/4.13/installing/installing_bare_metal/installing-bare-metal.html to always enable disk.EnableUUID in both IPI and UPI installs.
Description of problem:
After enabling realtime and high power consumption under workload hints in the performance profile, the test is falling since it cannot find stalld pid: msg: "failed to run command [pidof stalld]: output \"\"; error \"\"; command terminated with exit code 1",
Version-Release number of selected component (if applicable):
Openshift 4.14, 4.13
How reproducible:
Often (Flaky test)
Description of problem:
The environment variable OPENSHIFT_IMG_OVERRIDES is not retaining the order of mirrors listed under a source compared to the original mirror/source listing in the ICSP/IDMSs.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Setup a mgmt cluster with either an ICSP like: apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: name: image-policy-39 spec: repositoryDigestMirrors: - mirrors: - quay.io/openshift-release-dev/ocp-release - pull.q1w2.quay.rhcloud.com/openshift-release-dev/ocp-release source: quay.io/openshift-release-dev/ocp-release 2. Create a Hosted Cluster
Actual results:
Nodes cannot join the cluster because ignition cannot be generated
Expected results:
Nodes can join the cluster
Additional info:
Issue is most likely coming from here - https://github.com/openshift/hypershift/blob/dce6f51355317173be6bc525edfe059572c07690/support/util/util.go#L224
Description of problem:
Tested on gcp, there are 4 failureDomains a, b, c, f in CPMS, remove one a, a new master will be created in f. If readd f to CPMS, instance will be moved back from f to a
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Before update cpms. failureDomains: gcp: - zone: us-central1-a - zone: us-central1-b - zone: us-central1-c - zone: us-central1-f $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsungcp22-4glmq-master-2 Running n2-standard-4 us-central1 us-central1-c 3h4m zhsungcp22-4glmq-master-hzsf2-0 Running n2-standard-4 us-central1 us-central1-b 90m zhsungcp22-4glmq-master-plch8-1 Running n2-standard-4 us-central1 us-central1-a 11m zhsungcp22-4glmq-worker-a-cxf5w Running n2-standard-4 us-central1 us-central1-a 3h zhsungcp22-4glmq-worker-b-d5vzm Running n2-standard-4 us-central1 us-central1-b 3h zhsungcp22-4glmq-worker-c-4d897 Running n2-standard-4 us-central1 us-central1-c 3h 1. Delete failureDomain "zone: us-central1-a" in cpms, new machine Running in zone f. failureDomains: gcp: - zone: us-central1-b - zone: us-central1-c - zone: us-central1-f $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsungcp22-4glmq-master-2 Running n2-standard-4 us-central1 us-central1-c 3h19m zhsungcp22-4glmq-master-b7pdl-1 Running n2-standard-4 us-central1 us-central1-f 13m zhsungcp22-4glmq-master-hzsf2-0 Running n2-standard-4 us-central1 us-central1-b 106m zhsungcp22-4glmq-worker-a-cxf5w Running n2-standard-4 us-central1 us-central1-a 3h16m zhsungcp22-4glmq-worker-b-d5vzm Running n2-standard-4 us-central1 us-central1-b 3h16m zhsungcp22-4glmq-worker-c-4d897 Running n2-standard-4 us-central1 us-central1-c 3h16m 2. Add failureDomain "zone: us-central1-a" again, new machine running in zone a, the machine in zone f will be deleted. failureDomains: gcp: - zone: us-central1-a - zone: us-central1-f - zone: us-central1-c - zone: us-central1-b $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsungcp22-4glmq-master-2 Running n2-standard-4 us-central1 us-central1-c 3h35m zhsungcp22-4glmq-master-5kltp-1 Running n2-standard-4 us-central1 us-central1-a 12m zhsungcp22-4glmq-master-hzsf2-0 Running n2-standard-4 us-central1 us-central1-b 121m zhsungcp22-4glmq-worker-a-cxf5w Running n2-standard-4 us-central1 us-central1-a 3h32m zhsungcp22-4glmq-worker-b-d5vzm Running n2-standard-4 us-central1 us-central1-b 3h32m zhsungcp22-4glmq-worker-c-4d897 Running n2-standard-4 us-central1 us-central1-c 3h32m
Actual results:
Instance is moved back from f to a
Expected results:
Instance shouldn't be moved back from f to a
Additional info:
https://issues.redhat.com//browse/OCPBUGS-7366
Description of the problem:
In staging, UI 2.20.6, BE 2.20.1 - not able to set ODF on, getting "Failed to update the cluster", although according to the support-level api it should be supported
How reproducible:
100%
Steps to reproduce:
1. Create new OCP 4.13 and P/Z cpu_arc
2. try to enable ODF
3.
Actual results:
Expected results:
Description of problem:
API fields that are defaulted by a controller should document what their default is for each release version. Currently the field documents that "if empty, subject to platform chosen default", but it does not state what that is. To fix this, please add, after the platform chosen default prose: // The current default is XYZ. This will allow users to track the platform defaults over time from the API documentation. I would like to see this fixed before 4.13 and 4.14 are released please, it should be pretty quick to fix if we understand what those defaults are.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When ODF StorageSystem CR gets created through the Wizard, the LocalVolumeDiscovery doesn't bring/show devices with mpath type
Version-Release number of selected component (if applicable):
OCP 4.11.31
How reproducible:
All the time
Steps to Reproduce:
1. Get OCP 4.11 running with the LSO and ODF operators 2. Configure and present mpath devices to nodes used for ODF 3. Use the ODF wizard to create a StorageSystem object 4. Inspect the LocalVolumeDiscovery results.
Actual results:
There are no devices of mpath type shown by the ODF wizard / LocalVolumeDiscovery CR
Expected results:
LocalVolumeDiscovery should discover mpath device type
Additional info:
LocalVolumeSet already works with mpath if you manually define them in .spec or LocalVolume pointing to mpath devicePaths
Description of problem:
MCO depends on image registry, if not install image registry, installation will failed due to mco going to degraded
Version-Release number of selected component (if applicable):
payload image built from https://github.com/openshift/installer/pull/7421
How reproducible:
always
Steps to Reproduce:
1.Set "baselineCapabilitySet: None" when install a cluster, all the optional operators will not be installed. 2. 3.
Actual results:
09-01 15:50:34.770 level=error msg=Cluster operator machine-config Degraded is True with RenderConfigFailed: Failed to resync 4.14.0-0.ci.test-2023-08-31-033001-ci-ln-7xhl7yt-latest because: clusteroperators.config.openshift.io "image-registry" not found 09-01 15:50:34.770 level=error msg=Cluster operator machine-config Available is False with RenderConfigFailed: Cluster not available for [{operator 4.14.0-0.ci.test-2023-08-31-033001-ci-ln-7xhl7yt-latest}]: clusteroperators.config.openshift.io "image-registry" not found 09-01 15:50:34.770 level=info msg=Cluster operator network ManagementStateDegraded is False with : 09-01 15:50:34.770 level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
Expected results:
MCO should not be degraded if image registry is not installed
Additional info:
must-gather log https://drive.google.com/file/d/1E3FbPcVwZxBi33tHq7pyaHc8EM3eiTUa/view?usp=drive_link
Description of problem:
I am trying to build the operator image locally and fail because the registry `registry.ci.openshift.org/ocp/` requires authorization
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. git clone git@github.com:openshift/cluster-ingress-operator.git 2. export REPO=<path to a repository to upload the image> 3. run `make release-local`
Actual results:
[skip several lines] Step 1/10 : FROM registry.ci.openshift.org/ocp/builder:rhel-8-golang-1.19-openshift-4.12 AS builder unauthorized: authentication required
Expected results:
image is pulled and the build succeeded
Additional info:
There are two images that are not available: - registry.ci.openshift.org/ocp/builder:rhel-8-golang-1.19-openshift-4.12 - registry.ci.openshift.org/ocp/4.12:base I was able to fix this by changing the images to - registry.ci.openshift.org/openshift/release:golang-1.19 - registry.ci.openshift.org/origin/4.12:base see https://github.com/dudinea/cluster-ingress-operator/tree/fix-build-images-not-public I am not sure what I did is OK, but I suppose that this project, being part of OKD should be easily buildable by the public or at least the issue should be documented somewhere. I wanted to post this to the OKD project, but I am unable to select it in jira.
Description of problem:
Machine-config operator is not compliant with CIS benchmark rule "Ensure Usage of Unique Service Accounts" [1] as part of "ocp4-cis" profile used in compliance operator [2]. Observed that machine-config operator is using the default service account where default SA comes into play if there is no other service account specified. OpenShift core operators should be compliant with the CIS benchmark, i.e. the operators should run with their own serviceaccount rather than using the "default" one. [1] https://static.open-scap.org/ssg-guides/ssg-ocp4-guide-cis.html#xccdf_org.ssgproject.content_group_accounts [2] https://docs.openshift.com/container-platform/4.11/security/compliance_operator/compliance-operator-supported-profiles.html
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Core operators are using default service account
Expected results:
Core operators should run with their own service account
Additional info:
Kubernetes 1.27 removes long deprecated --container-runtime flag, see https://github.com/kubernetes/kubernetes/pull/114017
To ensure the upgrade path between 4.13 to 4.14 isn't affected we need to backport the changes to both 4.14 and 4.13.
Description of problem:
'Create' button on image pull secret creation form can not be re-enabled if it is disabled once
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-02-17-090603
How reproducible:
Always
Steps to Reproduce:
1. user logins to console 2. goes to Secrets -> Create Image pull secret, on the page - Secret name: test-secret - Authentication type: Upload configuration file, here we upload invalid JSON format file, console will give warning message 'Configuration file should be in JSON format.' and 'Create' button will be disabled 3. then we change Authentication type to 'Image registry credentials', fill up every required fields: Registry server address, Username and Password, we can see 'Create' button is still disabled
Actual results:
3. 'Create' button is still disabled, user has to cancel and fill the form again
Expected results:
3. we should re-enable Create button since we are trying to filling a form in a different way with all required fields correctly configured
Additional info:
Description of problem:
Hide the Duplicate Pipelines Card in the DevConsole Add Page
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Visit +Add Page of Dev Perspective
Actual results:
Duplicate Entry
Expected results:
No duplicates
Additional info:
Description of problem:
The control-plane-operator pod gets stuck deleting an awsendpointservice if its hostedzone is already gone:
Logs:
{"level":"error","ts":"2023-07-13T03:06:58Z","msg":"Reconciler error","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","aWSEndpointService":{"name":"private-router","namespace":"ocm-staging-24u87gg3qromrf8mg2r2531m41m0c1ji-diegohcp-west2"},"namespace":"ocm-staging-24u87gg3qromrf8mg2r2531m41m0c1ji-diegohcp-west2","name":"private-router","reconcileID":"59eea7b7-1649-4101-8686-78113f27567d","error":"failed to delete resource: NoSuchHostedZone: No hosted zone found with ID: Z05483711XJV23K8E97HK\n\tstatus code: 404, request id: f8686dd6-a906-4a5e-ba4a-3dd52ad50ec3","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}
Version-Release number of selected component (if applicable):
4.12.24
How reproducible:
Have not tried to reproduce yet, but should be fairly reproducible
Steps to Reproduce:
1. Install a PublicAndPrivate or Private HCP 2. Delete the Route53 Hosted Zone defined in its awsendpointservice's .status.dnsZoneID field 3. Start an uninstall 4. Observe the control-plane-operator looping on the above logs and the uninstall hanging
Actual results:
Uninstall hangs due to CPO being unable to delete the awsendpointservice
Expected results:
awsendpointservice cleans up, if the hosted zone is already gone CPO shouldn't care that it can't list hosted zones
Additional info:
Description of problem:
CredentialsRequest for Azure AD Workload Identity contains unnecessary network permissions. - Microsoft.Network/applicationSecurityGroups/delete - Microsoft.Network/applicationSecurityGroups/write - Microsoft.Network/loadBalancers/delete - Microsoft.Network/networkSecurityGroups/delete - Microsoft.Network/routeTables/delete - Microsoft.Network/routeTables/write - Microsoft.Network/virtualNetworks/subnets/delete - Microsoft.Network/virtualNetworks/subnets/write - Microsoft.Network/virtualNetworks/write - Microsoft.Resources/subscriptions/resourceGroups/delete - Microsoft.Resources/subscriptions/resourceGroups/write
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
N/A
Steps to Reproduce:
1. Remove above permissions from the Azure Credentials request and validate that MAO continues to function in Azure AD Workload Identity cluster.
Actual results:
Unnecessary network write permissions enumerated in CredentialsRequest.
Expected results:
Only necessary permissions enumerated in CredentialsRequest.
Additional info:
Additional unnecessary permissions will be hard to pin point but these specific permissions were questioned by MSFT and are likely only needed by the installer as output by CORS-1870 investigation.
Description of problem:
The oc client has recently had functionality added to reference an icsp manifest with a variety of commands (using the --icsp flag).
The issue is that the registry/repo scope in an icsp required to trigger a mapping is different between ocp and oc. OCP icsp will match an image at the registry level, where the OC client requires exact registry + repo to match. This difference can cause major confusion (especially without adequate warning/error messages in the oc client).
Example Image to mirror: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1631b0f0bf9c6dc4f9519ceb06b6ec9277f53f4599853fcfad3b3a47d2afd404o
In OCP registry.mirrorregistry.com:5000/openshift-release-dev/ will accurately mirror the image
But using OC with --icsp , quay.io/openshift-release-dev/ocp-v4.0-art-dev is required or or the mirroring will not match.
Version-Release number of selected component (if applicable):{code:none}
oc version
Client Version: 4.11.0-202212070335.p0.g1928ac4.assembly.stream-1928ac4
Kustomize Version: v4.5.4
Server Version: 4.12.0-rc.8
Kubernetes Version: v1.25.4+77bec7a
How reproducible:
100%
Steps to Reproduce:
1. Create an ICSP file with content similar to below (Replace with your mirror registry url)
apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: creationTimestamp: null name: image-policy spec: repositoryDigestMirrors: - mirrors: - registry.mirrorregistry.com:5005/openshift-release-dev source: quay.io/openshift-release-dev
2. Add the ICSP to a bm openshift cluster and wait for MCP to finish node restarts
3. SSH to a cluster node
4. Try to podman pull the following image with debug log level
podman pull --log-level=debug quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1631b0f0bf9c6dc4f9519ceb06b6ec9277f53f4599853fcfad3b3a47d2afd404
5. The log will show the mirror registry is attempted (Which is similar behavior to OCP)
6. Now try to extract a the payload image from the release using oc client and --icsp flag (ICSP file should be the same manifest uses at step 1)
oc adm release extract --command=openshift-baremetal-install --to=/data/install-config-generate/installercache/registry.mirrorregistry.com:5005/openshift-release-dev/ocp-release:4.12.0-rc.8-x86_64 --insecure=false --icsp-file=/tmp/icsp-file1635083302 registry.mirrorregistry.com:5005/openshift-release-dev/ocp-release:4.12.0-rc.8-x86_64 --registry-config=/tmp/registry-config1265925963
Expected results:
openshift-baremetal-install is extracted to the proper directory using the mirrored payload image
Actual result:
oc client does not match the payload image because the icsp is not exact, so it immediately tries quay.io rather than the mirror registry
ited with non-zero exit code 1: \nwarning: --icsp-file only applies to images referenced by digest and will be ignored for tags\nerror: unable to read image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1631b0f0bf9c6dc4f9519ceb06b6ec9277f53f4599853fcfad3b3a47d2afd404: Get \"https://quay.io/v2/\": dial tcp 52.203.129.140:443: i/o timeout\n" func=github.com/openshift/assisted-service/internal/oc.execute file="/remote-source/assisted-service/app/internal/oc/release.go:404" go-id=26228 request_id=
Additional info:
I understand that oc-mirror or oc adm release mirror provides an icsp manifest to use, but as OCP itself allows for a wider scope for mapping, it can cause great confusion that oc icsp scope is not in parity. At the very least a warning/error message in the oc client when the icsp partially matches an image (but is not used) would be VERY useful.
For reasons I still struggle to understand, in trying to mitigate issues stemming from the PSA changes to k8s, we decided on a convoluted architecture where one reconciler by one team (cluster-policy-controller) ignores openshift-* namespaces unless they have a specific label and are not part of the payload, while a reconciler on our team labels non-payload openshift-* namespaces appropriately so that the first one will do its security magic and keep workloads stable during this transition. This cockamamie scheme lead to a dependency between olm and cpc s.t. we can share the list of payload openshift-* namespaces.
This also means that we need to update the dependency at each release to keep parity with the OCP version of the dependency and olm.
We need to update the cpc dependency as the pipeline is blocked until we do (to letting an old version of the dependency, perhaps with a different list of payload openshift-* namespaces and breaking customer cluster or impacting their experience).
Note: this is currently blocking ART compliance PRs. We need to get this in ASAP.
1. Proposed title of this feature request
Allow Ingress to be modified the log length when using a sidecar
2. What is the nature and description of the request?
In the past we had the RFE-1794 where an option was created to specify the length of the HAProxy log, however this option was only available for when redirecting the log for an external syslog. We need this option to be available for when using a sidecar to collect the logs.
apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: default namespace: openshift-ingress-operator spec: replicas: 2 logging: access: destination: type: Container container: {}
Differently from the Syslog type, the Container type does not have any sub-parameter, which makes possible to configurate the log length.
As we can see in the RFE-1794, the option to change the log length already exists in the haproxy configuration, but when using the sidecar, only the default value(1024) is used.
3. Why does the customer need this? (List the business requirements here)
The default log length of HAProxy is 1024. When the clients communicate to the application with the long uri arguments, it cannot catch the full access log and the parameter info. It is required a option to setup 8192 or higher.
4. List any affected packages or components.
Description of problem:
Multus mac-vlan/ipvlan/vlan cni panics when master interface in container is missing
Version-Release number of selected component (if applicable):
metallb-operator.v4.13.0-202304190216 MetalLB Operator 4.13.0-202304190216 Succeeded
How reproducible:
Create pod with multiple vlan interfaces connected to missing master interface.
Steps to Reproduce:
1. Create pod with multiple vlan interfaces connected to missing master interface in container 2. Make sure that pod stuck in ContainerCreating state 3. Run oc describe pod PODNAME and read crash message: Normal Scheduled 22s default-scheduler Successfully assigned cni-tests/pod-one to worker-0 Normal AddedInterface 21s multus Add eth0 [10.128.2.231/23] from ovn-kubernetes Normal AddedInterface 21s multus Add ext0 [] from cni-tests/tap-one Normal AddedInterface 21s multus Add ext0.1 [2001:100::1/64] from cni-tests/mac-vlan-one Warning FailedCreatePodSandBox 18s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-one_cni-tests_2e831519-effc-4502-8ea7-749eda95bf1d_0(321d7181626b8bbfad062dd7c7cc2ef096f8547e93cb7481a18b7d3613eabffd): error adding pod cni-tests_pod-one to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [cni-tests/pod-one/2e831519-effc-4502-8ea7-749eda95bf1d:mac-vlan]: error adding container to network "mac-vlan": plugin type="macvlan" failed (add): netplugin failed: "panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x54281a]\n\ngoroutine 1 [running, locked to thread]:\npanic({0x560b00, 0x6979d0})\n\t/usr/lib/golang/src/runtime/panic.go:987 +0x3ba fp=0xc0001ad8f0 sp=0xc0001ad830 pc=0x433d7a\nruntime.panicmem(...)\n\t/usr/lib/golang/src/runtime/panic.go:260\nruntime.sigpanic()\n\t/usr/lib/golang/src/runtime/signal_unix.go:835 +0x2f6 fp=0xc0001ad940 sp=0xc0001ad8f0 pc=0x449cd6\nmain.getMTUByName({0xc00001a978, 0x4}, {0xc00002004a, 0x33}, 0x1)\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:167 +0x33a fp=0xc0001ada00 sp=0xc0001ad940 pc=0x54281a\nmain.loadConf(0xc000186770, {0xc00001e009, 0x19e})\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:120 +0x155 fp=0xc0001ada80 sp=0xc0001ada00 pc=0x5422d5\nmain.cmdAdd(0xc000186770)\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:287 +0x47 fp=0xc0001adcd0 sp=0xc0001ada80 pc=0x543b07\ngithub.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAndCall(0xc0000bdec8, 0xc000186770, {0x5c02b8, 0xc0000e4e40}, 0x592e80)\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:166 +0x20a fp=0xc0001add60 sp=0xc0001adcd0 pc=0x5371ca\ngithub.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain(0xc0000bdec8, 0x698320?, 0xc0000bdeb0?, 0x44ed89?, {0x5c02b8, 0xc0000e4e40}, {0xc0000000f0, 0x22})\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:219 +0x2ca fp=0xc0001ade68 sp=0xc0001add60 pc=0x53772a\ngithub.com/containernetworking/cni/pkg/skel.PluginMainWithError(...)\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:273\ngithub.com/containernetworking/cni/pkg/skel.PluginMain(0x588e01?, 0x10?, 0xc0000bdf50?, {0x5c02b8?, 0xc0000e4e40?}, {0xc0000000f0?, 0x0?})\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:288 +0xd1 fp=0xc0001adf18 sp=0xc0001ade68 pc=0x537d51\nmain.main()\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:432 +0xb6 fp=0xc0001adf80 sp=0xc0001adf18 pc=0x544b76\nruntime.main()\n\t/usr/lib/golang/src/runtime/proc.go:250 +0x212 fp=0xc0001adfe0 sp=0xc0001adf80 pc=0x436a12\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0001adfe8 sp=0xc0001adfe0 pc=0x462fc1\n\ngoroutine 2 [force gc (idle)]:\nruntime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000acfb0 sp=0xc0000acf90 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.forcegchelper()\n\t/usr/lib/golang/src/runtime/proc.go:302 +0xad fp=0xc0000acfe0 sp=0xc0000acfb0 pc=0x436c6d\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000acfe8 sp=0xc0000acfe0 pc=0x462fc1\ncreated by runtime.init.6\n\t/usr/lib/golang/src/runtime/proc.go:290 +0x25\n\ngoroutine 3 [GC sweep wait]:\nruntime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000ad790 sp=0xc0000ad770 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.bgsweep(0x0?)\n\t/usr/lib/golang/src/runtime/mgcsweep.go:278 +0x8e fp=0xc0000ad7c8 sp=0xc0000ad790 pc=0x423e4e\nruntime.gcenable.func1()\n\t/usr/lib/golang/src/runtime/mgc.go:178 +0x26 fp=0xc0000ad7e0 sp=0xc0000ad7c8 pc=0x418d06\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000ad7e8 sp=0xc0000ad7e0 pc=0x462fc1\ncreated by runtime.gcenable\n\t/usr/lib/golang/src/runtime/mgc.go:178 +0x6b\n\ngoroutine 4 [GC scavenge wait]:\nruntime.gopark(0xc0000ca000?, 0x5bf2b8?, 0x1?, 0x0?, 0x0?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000adf70 sp=0xc0000adf50 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.(*scavengerState).park(0x6a0920)\n\t/usr/lib/golang/src/runtime/mgcscavenge.go:389 +0x53 fp=0xc0000adfa0 sp=0xc0000adf70 pc=0x421ef3\nruntime.bgscavenge(0x0?)\n\t/usr/lib/golang/src/runtime/mgcscavenge.go:617 +0x45 fp=0xc0000adfc8 sp=0xc0000adfa0 pc=0x4224c5\nruntime.gcenable.func2()\n\t/usr/lib/golang/src/runtime/mgc.go:179 +0x26 fp=0xc0000adfe0 sp=0xc0000adfc8 pc=0x418ca6\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000adfe8 sp=0xc0000adfe0 pc=0x462fc1\ncreated by runtime.gcenable\n\t/usr/lib/golang/src/runtime/mgc.go:179 +0xaa\n\ngoroutine 5 [finalizer wait]:\nruntime.gopark(0x0?, 0xc0000ac670?, 0xab?, 0x61?, 0xc0000ac770?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000ac628 sp=0xc0000ac608 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.runfinq()\n\t/usr/lib/golang/src/runtime/mfinal.go:180 +0x10f fp=0xc0000ac7e0 sp=0xc0000ac628 pc=0x417e0f\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000ac7e8 sp=0xc0000ac7e0 pc=0x462fc1\ncreated by runtime.createfing\n\t/usr/lib/golang/src/runtime/mfinal.go:157 +0x45\n"
Actual results:
The readable error message should be provided instead.
Expected results:
We should handle such scenario without crash and The following log should be used instead. Error: Failed to create container due to the missing master interface XXX.
Additional info:
Description of problem:
Users are not able to upgrade an namespace scoped operator in OpenShift console . Subscription tab is not visible in web console to the user with admin rights. Only cluster-Admin users are able to update the operator.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Configure IDP. Add user. 2. Install any operator in specific namespace. 3. Assign project admin permission to the user for the same namespace 4. Login with the user and check if `Subscription` tab is visible to update the operator.
Actual results:
User is not able to update the operator. Subscription tab is not visible to the user in web console.
Expected results:
User must get an access to update the namespace scoped operator if user has the admin permission for the same project.
Additional info:
Tried to reproduce the issue and observed same behavior in OCP 4.10.20 , OCP 4.10.25 and OCP 4.10.34
Description of problem:
Installer as used with AWS, during a cluster destroy, does a get-all-roles and would delete roles based on a tag. If a customer is using AWS SEA which would deny any roles doing a get-all-roles in the AWS account, the installer fails.
Instead of error-out, the installer should gracefully handle being denied get-all-roles and move onward, so that a denying SCP would not get in the way of a successful cluster destroy on AWS.
Version-Release number of selected component (if applicable):
[ec2-user@ip-172-16-32-144 ~]$ rosa version 1.2.6
How reproducible:
1. Deploy ROSA STS, private with PrivateLink with AWS SEA 2. rosa delete cluster --debug 3. watch the debug logs of the installer to see it try to get-all-roles 4. installer fails when the SCP from AWS SEA denies the get-all-roles task
Steps to Reproduce: Philip Thomson Would you please fill out the below?
Steps list above.
Actual results:
time="2022-09-01T00:10:40Z" level=error msg="error after waiting for command completion" error="exit status 4" installID=zp56pxql time="2022-09-01T00:10:40Z" level=error msg="error provisioning cluster" error="exit status 4" installID=zp56pxql time="2022-09-01T00:10:40Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 4" installID=zp56pxql time="2022-09-01T00:12:47Z" level=info msg="copied /installconfig/install-config.yaml to /output/install-config.yaml" installID=55h2cvl5 time="2022-09-01T00:12:47Z" level=info msg="cleaning up resources from previous provision attempt" installID=55h2cvl5 time="2022-09-01T00:12:47Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:12:48Z" level=debug msg="search for matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:12:48Z" level=debug msg="search for IAM roles" installID=55h2cvl5 time="2022-09-01T00:12:49Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5 time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6b4b5144-2f4e-4fde-ba1a-04ed239b84c2" installID=55h2cvl5 time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6152e9c2-9c1c-478b-a5e3-11ff2508684e" installID=55h2cvl5 time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 8636f0ff-e984-4f02-870e-52170ab4e7bb" installID=55h2cvl5 time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 2385a980-dc9b-480f-955a-62ac1aaa6718" installID=55h2cvl5 time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 02ccef62-14e7-4310-b254-a0731995bd45" installID=55h2cvl5 time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: eca2081d-abd7-4c9b-b531-27ca8758f933" installID=55h2cvl5 time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6bda17e9-83e5-4688-86a0-2f84c77db759" installID=55h2cvl5 time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 827afa4a-8bb9-4e1e-af69-d5e8d125003a" installID=55h2cvl5 time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 8dcd0480-6f9e-49cb-a0dd-0c5f76107696" installID=55h2cvl5 time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 5095aed7-45de-4ca0-8c41-9db9e78ca5a6" installID=55h2cvl5 time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 04f7d0e0-4139-4f74-8f67-8d8a8a41d6b9" installID=55h2cvl5 time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 115f9514-b78b-42d1-b008-dc3181b61d33" installID=55h2cvl5 time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 68da4d93-a93e-410a-b3af-961122fe8df0" installID=55h2cvl5 time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 012221ea-2121-4b04-91f2-26c31c8458b1" installID=55h2cvl5 time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: e6c9328d-a4b9-4e69-8194-a68ed7af6c73" installID=55h2cvl5 time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 214ca7fb-d153-4d0d-9f9c-21b073c5bd35" installID=55h2cvl5 time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: 63b54e82-e2f6-48d4-bd0f-d2663bbc58bf" installID=55h2cvl5 time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: d24982b6-df65-4ba2-a3c0-5ac8d23947e1" installID=55h2cvl5 time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: e2c5737a-5014-4eb5-9150-1dd1939137c0" installID=55h2cvl5 time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7793fa7c-4c8d-4f9f-8f23-d393b85be97c" installID=55h2cvl5 time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: bef2c5ab-ef59-4be6-bf1a-2d89fddb90f1" installID=55h2cvl5 time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: ff04eb1b-9cf6-4fff-a503-d9292ff17ccd" installID=55h2cvl5 time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: 85e05de8-ba16-4366-bc86-721da651d770" installID=55h2cvl5 time="2022-09-01T00:12:56Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a9d864e4-cfdf-483d-a0d2-9b48a117abc4" installID=55h2cvl5 time="2022-09-01T00:12:56Z" level=debug msg="search for IAM users" installID=55h2cvl5 time="2022-09-01T00:12:56Z" level=debug msg="iterating over a page of 0 IAM users" installID=55h2cvl5 time="2022-09-01T00:12:56Z" level=debug msg="search for IAM instance profiles" installID=55h2cvl5 time="2022-09-01T00:12:56Z" level=info msg="error while finding resources to delete" error="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a9d864e4-cfdf-483d-a0d2-9b48a117abc4" installID=55h2cvl5 time="2022-09-01T00:12:56Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:12:57Z" level=info msg=Disassociated id=i-03d7570547d32071d installID=55h2cvl5 name=rosa-mv9dx3-xls7g-master-profile role=ROSA-ControlPlane-Role time="2022-09-01T00:12:57Z" level=info msg=Deleted InstanceProfileName=rosa-mv9dx3-xls7g-master-profile arn="arn:aws:iam::646284873784:instance-profile/rosa-mv9dx3-xls7g-master-profile" id=i-03d7570547d32071d installID=55h2cvl5 time="2022-09-01T00:12:57Z" level=debug msg=Terminating id=i-03d7570547d32071d installID=55h2cvl5 time="2022-09-01T00:12:58Z" level=debug msg=Terminating id=i-08bee3857e5265ba4 installID=55h2cvl5 time="2022-09-01T00:12:58Z" level=debug msg=Terminating id=i-00df6e7b34aa65c9b installID=55h2cvl5 time="2022-09-01T00:13:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:13:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:13:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:13:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:13:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:13:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:14:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:14:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:14:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:14:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:14:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:14:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:15:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:15:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:15:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:15:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:15:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:15:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:16:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:16:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:16:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:16:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:16:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:16:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:17:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:17:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:17:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:17:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:17:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:17:49Z" level=info msg=Deleted id=rosa-mv9dx3-xls7g-sint/2e99b98b94304d80 installID=55h2cvl5 time="2022-09-01T00:17:49Z" level=info msg=Deleted id=eni-0e4ee5cf8f9a8fdd2 installID=55h2cvl5 time="2022-09-01T00:17:50Z" level=debug msg="Revoked ingress permissions" id=sg-03265ad2fae661b8c installID=55h2cvl5 time="2022-09-01T00:17:50Z" level=debug msg="Revoked egress permissions" id=sg-03265ad2fae661b8c installID=55h2cvl5 time="2022-09-01T00:17:50Z" level=debug msg="DependencyViolation: resource sg-03265ad2fae661b8c has a dependent object\n\tstatus code: 400, request id: f7c35709-a23d-49fd-ac6a-f092661f6966" arn="arn:aws:ec2:ca-central-1:646284873784:security-group/sg-03265ad2fae661b8c" installID=55h2cvl5 time="2022-09-01T00:17:51Z" level=info msg=Deleted id=eni-0e592a2768c157360 installID=55h2cvl5 time="2022-09-01T00:17:52Z" level=debug msg="listing AWS hosted zones \"rosa-mv9dx3.0ffs.p1.openshiftapps.com.\" (page 0)" id=Z072427539WBI718F6BCC installID=55h2cvl5 time="2022-09-01T00:17:52Z" level=debug msg="listing AWS hosted zones \"0ffs.p1.openshiftapps.com.\" (page 0)" id=Z072427539WBI718F6BCC installID=55h2cvl5 time="2022-09-01T00:17:53Z" level=info msg=Deleted id=Z072427539WBI718F6BCC installID=55h2cvl5 time="2022-09-01T00:17:53Z" level=debug msg="Revoked ingress permissions" id=sg-08bfbb32ea92f583e installID=55h2cvl5 time="2022-09-01T00:17:53Z" level=debug msg="Revoked egress permissions" id=sg-08bfbb32ea92f583e installID=55h2cvl5 time="2022-09-01T00:17:54Z" level=info msg=Deleted id=sg-08bfbb32ea92f583e installID=55h2cvl5 time="2022-09-01T00:17:54Z" level=info msg=Deleted id=rosa-mv9dx3-xls7g-aint/635162452c08e059 installID=55h2cvl5 time="2022-09-01T00:17:54Z" level=info msg=Deleted id=eni-049f0174866d87270 installID=55h2cvl5 time="2022-09-01T00:17:54Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:17:55Z" level=debug msg="search for matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:17:55Z" level=debug msg="no deletions from us-east-1, removing client" installID=55h2cvl5 time="2022-09-01T00:17:55Z" level=debug msg="search for IAM roles" installID=55h2cvl5 time="2022-09-01T00:17:56Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5 time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 06b804ae-160c-4fa7-92de-fd69adc07db2" installID=55h2cvl5 time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 2a5dd4ad-9c3e-40ee-b478-73c79671d744" installID=55h2cvl5 time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: e61daee8-6d2c-4707-b4c9-c4fdd6b5091c" installID=55h2cvl5 time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 1b743447-a778-4f9e-8b48-5923fd5c14ce" installID=55h2cvl5 time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: da8c8a42-8e79-48e5-b548-c604cb10d6f4" installID=55h2cvl5 time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7d7840e4-a1b4-4ea2-bb83-9ee55882de54" installID=55h2cvl5 time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7f2e04ed-8c49-42e4-b35e-563093a57e5b" installID=55h2cvl5 time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: cd2b4962-e610-4cc4-92bc-827fe7a49b48" installID=55h2cvl5 time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: be005a09-f62c-4894-8c82-70c375d379a9" installID=55h2cvl5 time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 541d92f4-33ce-4a50-93d8-dcfd2306eeb0" installID=55h2cvl5 time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6dd81743-94c4-479a-b945-ffb1af763007" installID=55h2cvl5 time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a269f47b-97bc-4609-b124-d1ef5d997a91" installID=55h2cvl5 time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 33c3c0a5-e5c9-4125-9400-aafb363c683c" installID=55h2cvl5 time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 32e87471-6d21-42a7-bfd8-d5323856f94d" installID=55h2cvl5 time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: b2cc6745-0217-44fe-a48b-44e56e889c9e" installID=55h2cvl5 time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 09f81582-6685-4dc9-99f0-ed33565ab4f4" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: cea9116c-2b54-4caa-9776-83559d27b8f8" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: 430d7750-c538-42a5-84b5-52bc77ce2d56" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 279038e4-f3c9-4700-b590-9a90f9b8d3a2" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: 5e2f40ae-3dc7-4773-a5cd-40bf9aa36c03" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: 92a27a7b-14f5-455b-aa39-3c995806b83e" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0da4f66c-c6b1-453c-a8c8-dc0399b24bb9" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: f2c94beb-a222-4bad-abe1-8de5786f5e59" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 829c3569-b2f2-4b9d-94a0-69644b690066" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=debug msg="search for IAM users" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=debug msg="iterating over a page of 0 IAM users" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=debug msg="search for IAM instance profiles" installID=55h2cvl5 time="2022-09-01T00:17:58Z" level=info msg="error while finding resources to delete" error="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 829c3569-b2f2-4b9d-94a0-69644b690066" installID=55h2cvl5 time="2022-09-01T00:18:09Z" level=info msg=Deleted id=sg-03265ad2fae661b8c installID=55h2cvl5 time="2022-09-01T00:18:09Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5 time="2022-09-01T00:18:09Z" level=debug msg="no deletions from ca-central-1, removing client" installID=55h2cvl5 time="2022-09-01T00:18:09Z" level=debug msg="search for IAM roles" installID=55h2cvl5 time="2022-09-01T00:18:10Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5 time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0e8e0bea-b512-469b-a996-8722a0f7fa25" installID=55h2cvl5 time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 288456a2-0cd5-46f1-a5d2-6b4006a5dc0e" installID=55h2cvl5 time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 321df940-70fc-45e7-8c56-59fe5b89e84f" installID=55h2cvl5 time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 45bebf36-8bf9-4c78-a80f-c6a5e98b2187" installID=55h2cvl5 time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: eea00ae2-1a72-43f9-9459-a1c003194137" installID=55h2cvl5 time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0ef5a102-b764-4e17-999f-d820ebc1ec12" installID=55h2cvl5 time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 107d0ccf-94e7-41c4-96cd-450b66a84101" installID=55h2cvl5 time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: da9bd868-8384-4072-9fb4-e6a66e94d2a1" installID=55h2cvl5 time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 74fbf44c-d02d-4072-b038-fa456246b6a8" installID=55h2cvl5 time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 365116d6-1467-49c3-8f58-1bc005aa251f" installID=55h2cvl5 time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 20f91de5-cfeb-45e0-bb46-7b66d62cc749" installID=55h2cvl5 time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 924fa288-f1b9-49b8-b549-a930f6f771ce" installID=55h2cvl5 time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 4beb233d-40d6-4016-872a-8757af8f98ee" installID=55h2cvl5 time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 77951f62-e0b4-4a9b-a20c-ea40d6432e84" installID=55h2cvl5 time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 13ad38c8-89dc-461d-9763-870eec3a6ba1" installID=55h2cvl5 time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a8fe199d-12fb-4141-a944-c7c5516daf25" installID=55h2cvl5 time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: b487c62f-5ac5-4fa0-b835-f70838b1d178" installID=55h2cvl5 time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: 97bfcb55-ae1f-4859-9c12-03de09607f79" installID=55h2cvl5 time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: ca1094f6-714e-4042-9134-75f4c6d9d0df" installID=55h2cvl5 time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: ca1db477-ee6a-4d03-8b57-52b335b2bbe6" installID=55h2cvl5 time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: 1fc32d09-588b-4d80-ad62-748f7fb55efd" installID=55h2cvl5 time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7d906cc2-eaaa-439b-97e0-503615ce5d43" installID=55h2cvl5 time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: ee6a5647-20b1-4880-932b-bfd70b945077" installID=55h2cvl5 time="2022-09-01T00:18:12Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a424891e-48ab-4ad4-9150-9ef1076dcb9c" installID=55h2cvl5 Repeats the not authroized errors probably 50+ times.
Expected results:
For these errors not to show up during install.
Additional info:
Again this is only due to ROSA being install in an AWS SEA environment - https://github.com/aws-samples/aws-secure-environment-accelerator.
"etcdserver: leader changed" causes clients to fail.
This error should never bubble up to clients because the kube-apiserver can always retry this failure mode since it knows the data was not modified. When etcd adjusts timeouts for leader election and heartbeating for slow hardware like Azure, the hardcoded timeouts in the kube-apiserver/etcd fail. See
Simply saying, "oh, it's hardcoded and kube" isn't good enough. We have previously had a storage shim to retry such problems. If all else fails, bringing back the small shim to retry Unavailable etcd errors longer is appropriate to fix all available clients.
Additionally, this etcd capability is being made more widely available and this bug prevents that from working.
This came up a while ago, see https://groups.google.com/u/1/a/redhat.com/g/aos-devel/c/HuOTwtI4a9I/m/nX9mKjeqAAAJ
Basically this MC:
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: worker-override spec: kernelType: realtime osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4cc3995d5fc11e3b22140d8f2f91f78834e86a210325cbf0525a62725f8e099
Will degrade the node with
E0301 21:25:09.234001 3306 writer.go:200] Marking Degraded due to: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: Could not depsolve transaction; 1 problem detected: Problem: package kernel-modules-core-5.14.0-282.el9.x86_64 requires kernel-uname-r = 5.14.0-282.el9.x86_64, but none of the providers can be installed - conflicting requests : exit status 1
It's kind of annoying here because the packages to remove are now OS version dependent. A while ago I filed https://github.com/coreos/rpm-ostree/issues/2542 which would push the problem down into rpm-ostree, which is in a better situation to deal with it, and that may be the fix...but it's also pushing the problem down there in a way that's going to be maintenance pain (but, we can deal with that).
It's also possible that we may need to explicitly request installation of `kernel-rt-modules-core`...I'll look.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Ingress-canary Daemon Set does not tolerate Infra taint "NoExecute"
Version-Release number of selected component (if applicable):
OCPv4.9
How reproducible:
Always
Steps to Reproduce:
1.Label and Taint Node
$ oc describe node worker-0.cluster49.lab.pnq2.cee.redhat.com | grep infra
Roles: custom,infra,test
node-role.kubernetes.io/infra= <----
Taints: node-role.kubernetes.io/infra=reserved:NoExecute <----
node-role.kubernetes.io/infra=reserved:NoSchedule <----
2.Edit ingress-canary ds and add NoExecute toleration
$ oc get ds -o yaml | grep -i tole -A6
tolerations:
3. The Daemon Set configuration gets overwritten after some time, probably by the managing operator, and the pods are terminated on the infra nodes.
Actual results:
Infra taint toleration NoExecute gets overwritten :
$ oc get ds -o yaml | grep -i tole -A6
tolerations:
Expected results:
Ingress canary Daemon Set should be able to tolerate the NoExecute taint toleration.
Additional info: Same taint as the product documentation are used (node-role.kubernetes.io/infra)
Description of problem:
Under heavy control plane load (bringing up ~200 pods), prometheus/promtail spikes to over 100% CPU, node_exporter goes to ~200% cpu and stays there for 5-10 minutes. Tested on a GCP cluster bot using 2 physical core (4 vcpu) workers. This starves out essential platform functions like OVS from getting any CPU and causes the data plane to go down. Running perf against node_exporter reveals the application is consuming the majority of its CPU trying to list new interfaces being added in sysfs. This looks like it is due to disbling netlink via: https://issues.redhat.com/browse/OCPBUGS-8282 This operation grabs the rtnl lock which can compete with other components on the host that are trying to configure networking.
Version-Release number of selected component (if applicable):
Tested on 4.13 and 4.14 with GCP.
How reproducible:
3/4 times
Steps to Reproduce:
1. Launch gcp with cluster bot 2. Create a deployment with pause containers which will max out pods on the nodes: --- apiVersion: apps/v1 kind: Deployment metadata: name: webserver-deployment namespace: openshift-ovn-kubernetes labels: pod-name: server app: nginx role: webserver spec: replicas: 700 selector: matchLabels: app: nginx template: metadata: labels: app: nginx role: webserver spec: containers: - name: webserver1 image: k8s.gcr.io/pause:3.1 ports: - containerPort: 80 name: serve-80 protocol: TCP 3. Watch top cpu output. Wait for node_exporter and prometheus to show very high CPU. If this does not happen, proceed to step 4. 4. Delete the deployment and then recreate it. 5. High and persistent CPU usage should now be observed.
Actual results:
CPU is pegged on the host for several minutes. Terminal is almost unresponsive. Only way to fix it was to delete node_exporter and prometheus DS.
Expected results:
Prometheus and other metrics related applications should: 1. use netlink to avoid grabbing rtnl lock 2. should be cpu limited. Certain required applications in OCP are resource unbounded (like networking data plane) to ensure the node's core functions continue to work. Metrics however should be CPU limited to avoid tooling from locking up a node.
Additional info:
Perf summary (will attach full perf output) 99.94% 0.00% node_exporter node_exporter [.] runtime.goexit.abi0 | ---runtime.goexit.abi0 | --99.33%--github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func2 | --99.33%--github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1 | --99.33%--github.com/prometheus/node_exporter/collector.execute | |--97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).Update | | | --97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).netClassSysfsUpdate | | | --97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).getNetClassInfo | | | --97.64%--github.com/prometheus/procfs/sysfs.FS.NetClassByIface | | | --97.64%--github.com/prometheus/procfs/sysfs.parseNetClassIface | | | --97.61%--github.com/prometheus/procfs/internal/util.SysReadFile | | | --97.45%--syscall.read | | | --97.45%--syscall.Syscall | | | --97.45%--runtime/internal/syscall.Syscall6 | | | --70.34%--entry_SYSCALL_64_after_hwframe | do_syscall_64 | | | |--39.13%--ksys_read | | | | | |--31.97%--vfs_read
Description of problem:
Since we migrated some our jobs to OCP 4.14, we are experiencing a lot of flakiness with the "openshift-tests" binary which panics when trying to retrieve the logs of etcd: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_assisted-test-infra/2212/pull-ci-openshift-assisted-test-infra-master-e2e-metal-assisted/1673615526967906304#1:build-log.txt%3A161-191 Here's the impact on our jobs: https://search.ci.openshift.org/?search=error+reading+pod+logs&maxAge=48h&context=1&type=build-log&name=.*assisted.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
N/A
How reproducible:
Happens from time to time against OCP 4.14
Steps to Reproduce:
1. Provision an OCP cluster 4.14 2. Run the conformance tests on it with "openshift-tests"
Actual results:
The binary "openshift-tests" panics from time to time: [2023-06-27 10:12:07] time="2023-06-27T10:12:07Z" level=error msg="error reading pod logs" error="container \"etcd\" in pod \"etcd-test-infra-cluster-a1729bd4-master-2\" is not available" pod=etcd-test-infra-cluster-a1729bd4-master-2 [2023-06-27 10:12:07] panic: runtime error: invalid memory address or nil pointer dereference [2023-06-27 10:12:07] [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x26eb9b5] [2023-06-27 10:12:07] [2023-06-27 10:12:07] goroutine 1 [running]: [2023-06-27 10:12:07] bufio.(*Scanner).Scan(0xc005954250) [2023-06-27 10:12:07] bufio/scan.go:214 +0x855 [2023-06-27 10:12:07] github.com/openshift/origin/pkg/monitor/intervalcreation.IntervalsFromPodLogs({0x8d91460, 0xc004a43d40}, {0xc8b83c0?, 0xc006138000?, 0xc8b83c0?}, {0x8d91460?, 0xc004a43d40?, 0xc8b83c0?}) [2023-06-27 10:12:07] github.com/openshift/origin/pkg/monitor/intervalcreation/podlogs.go:130 +0x8cd [2023-06-27 10:12:07] github.com/openshift/origin/pkg/monitor/intervalcreation.InsertIntervalsFromCluster({0x8d441e0, 0xc000ffd900}, 0xc0008b4000?, {0xc005f88000?, 0x539, 0x0?}, 0x25e1e39?, {0xc11ecb5d446c4f2c, 0x4fb99e6af, 0xc8b83c0}, ...) [2023-06-27 10:12:07] github.com/openshift/origin/pkg/monitor/intervalcreation/types.go:65 +0x274 [2023-06-27 10:12:07] github.com/openshift/origin/pkg/test/ginkgo.(*MonitorEventsOptions).End(0xc001083050, {0x8d441e0, 0xc000ffd900}, 0x1?, {0x7fff15b2ccde, 0x16}) [2023-06-27 10:12:07] github.com/openshift/origin/pkg/test/ginkgo/options_monitor_events.go:170 +0x225 [2023-06-27 10:12:07] github.com/openshift/origin/pkg/test/ginkgo.(*Options).Run(0xc0013e2000, 0xc00012e380, {0x8126d1e, 0xf}) [2023-06-27 10:12:07] github.com/openshift/origin/pkg/test/ginkgo/cmd_runsuite.go:506 +0x2d9a [2023-06-27 10:12:07] main.newRunCommand.func1.1() [2023-06-27 10:12:07] github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:330 +0x2d4 [2023-06-27 10:12:07] main.mirrorToFile(0xc0013e2000, 0xc0014cdb30) [2023-06-27 10:12:07] github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:476 +0x5f2 [2023-06-27 10:12:07] main.newRunCommand.func1(0xc0013e0300?, {0xc000862ea0?, 0x6?, 0x6?}) [2023-06-27 10:12:07] github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:311 +0x5c [2023-06-27 10:12:07] github.com/spf13/cobra.(*Command).execute(0xc0013e0300, {0xc000862e40, 0x6, 0x6}) [2023-06-27 10:12:07] github.com/spf13/cobra@v1.6.0/command.go:916 +0x862 [2023-06-27 10:12:07] github.com/spf13/cobra.(*Command).ExecuteC(0xc0013e0000) [2023-06-27 10:12:07] github.com/spf13/cobra@v1.6.0/command.go:1040 +0x3bd [2023-06-27 10:12:07] github.com/spf13/cobra.(*Command).Execute(...) [2023-06-27 10:12:07] github.com/spf13/cobra@v1.6.0/command.go:968 [2023-06-27 10:12:07] main.main.func1(0xc00011b300?) [2023-06-27 10:12:07] github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:96 +0x8a [2023-06-27 10:12:07] main.main() [2023-06-27 10:12:07] github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:97 +0x516
Expected results:
No panics
Additional info:
The source of the panic has been pin-pointed here: https://github.com/openshift/origin/pull/27772#discussion_r1243600596
Description of problem:
per oc set route-backends -h output: Routes may have one or more optional backend services with weights controlling how much traffic flows to each service. [...] **If all weights are zero the route will not send traffic to any backends.** this is not the case anymore for a route with a single backend.
Version-Release number of selected component (if applicable):
at least from OCP 4.12 onward
How reproducible:
all the time
Steps to Reproduce:
1. kubectl create -f example/ 2. kubectl patch route example -p '{"spec":{"to": {"weight": 0}}}' --type merge 3. curl http://localhost -H "Host: example.local"
Actual results:
curl succeeds
Expected results:
curl fails
Additional info:
https://access.redhat.com/support/cases/#/case/03567697
is regression following NE-822. Reverting
https://github.com/openshift/router/commit/9656da7d5e2ac0962f3eaf718ad7a8c8b2172cfa makes it work again.
Sanitize OWNERS/OWNER_ALIASES in all CSI driver and operator repos.
For driver repos:
1) OWNERS must have `component`:
component: "Storage / Kubernetes External Components"
2) OWNER_ALIASES must have all team members of Storage team.
For operator repos:
1) OWNERS must have:
component: "Storage / Operators"
if the kubeadmin secret was deleted successfully from the guest cluster, but the `SecretHashAnnotation` annotation deletion on the oauthDeployment failed, the annotation will not be reconciled again and the annotation will never be removed.
context: https://redhat-internal.slack.com/archives/C01C8502FMM/p1684765042825929
See https://issues.redhat.com//browse/MON-3173 for details.
Having the test failing may be confusing.
+ we should make the test clearer.
Description of problem:
GCP XPN installs require the permission `projects/<host-project>/roles/dns.networks.bindPrivateDNSZone` in the host project. This permission is not always provided in organizations. The installer requires this permission in order to create a private DNS zone and bind it to the shared networks. Instead, the installer should be able to create records in a provided private zone that matches the base domain.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
While deploy 3671 SNOs via ACM and ZTP, 19 SNO clusters failed to install because the clusterversion object complained that the cluster operator operator-lifecycle-manager is not available.
Version-Release number of selected component (if applicable):
Hub OCP 4.12.14 SNO Deployed OCP 4.13.0-rc.6 ACM - 2.8.0-DOWNSTREAM-2023-04-30-18-44-29
How reproducible:
19 out of 51 failed clusters out of 3671 total installs ~.5% of installs might experience this however it represents ~37% of all install failures
Steps to Reproduce:
1. 2. 3.
Actual results:
# cat cluster-install-failures | grep OLM | awk '{print $1}' | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion --no-headers" vm00096 version False True 15h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm00334 version False True 19h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm00593 version False True 19h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm01095 version False True 19h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm01192 version False True 19h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm01447 version False True 18h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm01566 version False True 19h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm01707 version False True 17h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm01742 version False True 15h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm01798 version False True 13h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm01810 version False True 19h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm02020 version False True 19h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm02091 version False True 20h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm02363 version False True 13h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm02590 version False True 20h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm02908 version False True 18h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm03253 version False True 14h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm03500 version False True 17h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available vm03654 version False True 17h Unable to apply 4.13.0-rc.6: the cluster operator operator-lifecycle-manager is not available
Expected results:
Additional info:
There appears to be two distinguishing failure signatures in the list of cluster operators, every cluster shows that the OLM isn't available and is degraded and more than half of the clusters show no information regarding the operator-lifecycle-manager-packageserver.
# cat cluster-install-failures | grep OLM | awk '{print $1}' | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get co operator-lifecycle-manager --no-headers" vm00096 operator-lifecycle-manager False True True 15h vm00334 operator-lifecycle-manager False True True 19h vm00593 operator-lifecycle-manager False True True 19h vm01095 operator-lifecycle-manager False True True 19h vm01192 operator-lifecycle-manager False True True 19h vm01447 operator-lifecycle-manager False True True 18h vm01566 operator-lifecycle-manager False True True 19h vm01707 operator-lifecycle-manager False True True 17h vm01742 operator-lifecycle-manager False True True 15h vm01798 operator-lifecycle-manager False True True 13h vm01810 operator-lifecycle-manager False True True 19h vm02020 operator-lifecycle-manager False True True 19h vm02091 operator-lifecycle-manager False True True 20h vm02363 operator-lifecycle-manager False True True 13h vm02590 operator-lifecycle-manager False True True 20h vm02908 operator-lifecycle-manager False True True 18h vm03253 operator-lifecycle-manager False True True 14h vm03500 operator-lifecycle-manager False True True 17h vm03654 operator-lifecycle-manager False True True 17h # cat cluster-install-failures | grep OLM | awk '{print $1}' | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get co operator-lifecycle-manager-packageserver --no-headers" vm00096 operator-lifecycle-manager-packageserver vm00334 operator-lifecycle-manager-packageserver False True False 19h vm00593 operator-lifecycle-manager-packageserver False True False 19h vm01095 operator-lifecycle-manager-packageserver vm01192 operator-lifecycle-manager-packageserver vm01447 operator-lifecycle-manager-packageserver vm01566 operator-lifecycle-manager-packageserver False True False 19h vm01707 operator-lifecycle-manager-packageserver vm01742 operator-lifecycle-manager-packageserver False True False 15h vm01798 operator-lifecycle-manager-packageserver vm01810 operator-lifecycle-manager-packageserver vm02020 operator-lifecycle-manager-packageserver vm02091 operator-lifecycle-manager-packageserver False True False 20h vm02363 operator-lifecycle-manager-packageserver False True False 13h vm02590 operator-lifecycle-manager-packageserver False True False 20h vm02908 operator-lifecycle-manager-packageserver False True False 18h vm03253 operator-lifecycle-manager-packageserver vm03500 operator-lifecycle-manager-packageserver vm03654 operator-lifecycle-manager-packageserver
Viewing the pods in the openshift-operator-lifecycle-manager for these clusters shows no packageserver pod:
# cat cluster-install-failures | grep OLM | awk '{print $1}' | xargs -I % sh -c "echo '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get po -n openshift-operator-lifecycle-manager" vm00096 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-9rm9j 1/1 Running 1 (15h ago) 15h collect-profiles-28053720-kbsdn 0/1 Completed 0 33m collect-profiles-28053735-dzkf8 0/1 Completed 0 18m collect-profiles-28053750-skvcn 0/1 Completed 0 3m1s olm-operator-66658fffbb-gj294 1/1 Running 0 15h package-server-manager-654759688-bxnwj 1/1 Running 0 15h vm00334 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-xcw9r 1/1 Running 1 (19h ago) 19h collect-profiles-28053720-ppq6x 0/1 Completed 0 32m collect-profiles-28053735-r2rvw 0/1 Completed 0 18m collect-profiles-28053750-lgb4r 0/1 Completed 0 3m2s olm-operator-66658fffbb-t4nxg 1/1 Running 0 19h package-server-manager-654759688-6n7gp 1/1 Running 0 19h vm00593 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-rwfwp 1/1 Running 1 (19h ago) 19h collect-profiles-28053720-7p6tq 0/1 Completed 0 33m collect-profiles-28053735-nqzn9 0/1 Completed 0 18m collect-profiles-28053750-zppm6 0/1 Completed 0 3m2s olm-operator-66658fffbb-4gcpv 1/1 Running 0 19h package-server-manager-654759688-rbjdw 1/1 Running 0 19h vm01095 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-2tp6j 1/1 Running 0 19h collect-profiles-28053720-bnrfz 0/1 Completed 0 33m collect-profiles-28053735-p8bl5 0/1 Completed 0 18m collect-profiles-28053750-mg9nv 0/1 Completed 0 3m2s olm-operator-66658fffbb-cb95l 1/1 Running 0 19h package-server-manager-654759688-2mqdm 1/1 Running 0 19h vm01192 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-2crgg 1/1 Running 0 19h collect-profiles-28053720-2rknm 0/1 Completed 0 33m collect-profiles-28053735-wc5dn 0/1 Completed 0 18m collect-profiles-28053750-g5bhj 0/1 Completed 0 3m2s olm-operator-66658fffbb-5hlh4 1/1 Running 0 19h package-server-manager-654759688-xfp24 1/1 Running 0 19h vm01447 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-p8gd4 1/1 Running 0 18h collect-profiles-28053720-kjw4w 0/1 Completed 0 33m collect-profiles-28053735-k7xxp 0/1 Completed 0 17m collect-profiles-28053750-fn5gq 0/1 Completed 0 3m3s olm-operator-66658fffbb-rshjq 1/1 Running 1 (18h ago) 18h package-server-manager-654759688-hrmfd 1/1 Running 0 18h vm01566 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-gbrnj 1/1 Running 0 19h collect-profiles-28053720-2wdcp 0/1 Completed 0 33m collect-profiles-28053735-t7x5b 0/1 Completed 0 18m collect-profiles-28053750-wdmtt 0/1 Completed 0 3m3s olm-operator-66658fffbb-fsxrx 1/1 Running 0 19h package-server-manager-654759688-4mdz8 1/1 Running 1 (19h ago) 19h vm01707 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-f2ns6 1/1 Running 0 17h collect-profiles-28053720-72sjt 0/1 Completed 0 33m collect-profiles-28053735-qzgx4 0/1 Completed 0 18m collect-profiles-28053750-mrpbl 0/1 Completed 0 3m3s olm-operator-66658fffbb-jwp2l 1/1 Running 0 17h package-server-manager-654759688-f7bm4 1/1 Running 0 17h vm01742 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-lhv6f 1/1 Running 1 (15h ago) 15h collect-profiles-28053720-4kqtf 0/1 Completed 0 33m collect-profiles-28053735-hw7kp 0/1 Completed 0 18m collect-profiles-28053750-6ztq2 0/1 Completed 0 3m4s olm-operator-66658fffbb-5sqlc 1/1 Running 0 15h package-server-manager-654759688-n6sms 1/1 Running 0 15h vm01798 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-kx7nx 1/1 Running 2 (13h ago) 13h collect-profiles-28053720-7vlqq 0/1 Completed 0 33m collect-profiles-28053735-m8ltn 0/1 Completed 0 18m collect-profiles-28053750-hrfnk 0/1 Completed 0 3m4s olm-operator-66658fffbb-5z74m 1/1 Running 1 (13h ago) 13h package-server-manager-654759688-6jbnz 1/1 Running 0 13h vm01810 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-v5vr6 1/1 Running 2 (19h ago) 19h collect-profiles-28053720-m26dn 0/1 Completed 0 33m collect-profiles-28053735-64j7f 0/1 Completed 0 18m collect-profiles-28053750-qf69b 0/1 Completed 0 3m4s olm-operator-66658fffbb-gxt2b 1/1 Running 0 19h package-server-manager-654759688-dz6p6 1/1 Running 0 19h vm02020 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-2qqk6 1/1 Running 0 19h collect-profiles-28053720-5cktx 0/1 Completed 0 33m collect-profiles-28053735-ls6n9 0/1 Completed 0 18m collect-profiles-28053750-bj6gl 0/1 Completed 0 3m4s olm-operator-66658fffbb-zsr4g 1/1 Running 0 19h package-server-manager-654759688-2dnfd 1/1 Running 0 19h vm02091 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-whftg 1/1 Running 1 (20h ago) 20h collect-profiles-28053720-zqcbs 0/1 Completed 0 33m collect-profiles-28053735-v8lf5 0/1 Completed 0 18m collect-profiles-28053750-rshdd 0/1 Completed 0 3m5s olm-operator-66658fffbb-876ps 1/1 Running 0 20h package-server-manager-654759688-smc8q 1/1 Running 0 20h vm02363 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-zgn5m 1/1 Running 1 (13h ago) 13h collect-profiles-28053720-dpkqq 0/1 Completed 0 33m collect-profiles-28053735-nfqmf 0/1 Completed 0 18m collect-profiles-28053750-jfhdz 0/1 Completed 0 3m5s olm-operator-66658fffbb-bbrgb 1/1 Running 1 (13h ago) 13h package-server-manager-654759688-7pv96 1/1 Running 0 13h vm02590 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-v9mvc 1/1 Running 2 (20h ago) 20h collect-profiles-28053720-pfcbd 0/1 Completed 0 33m collect-profiles-28053735-5dxbl 0/1 Completed 0 18m collect-profiles-28053750-95f6g 0/1 Completed 0 3m5s olm-operator-66658fffbb-5knlj 1/1 Running 0 20h package-server-manager-654759688-7qkgb 1/1 Running 0 20h vm02908 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-cnmjf 1/1 Running 0 18h collect-profiles-28053720-ks6h7 0/1 Completed 0 33m collect-profiles-28053735-r682b 0/1 Completed 0 18m collect-profiles-28053750-9jrx4 0/1 Completed 0 3m5s olm-operator-66658fffbb-7bd2v 1/1 Running 1 (18h ago) 18h package-server-manager-654759688-5r6gq 1/1 Running 0 18h vm03253 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-8wtgg 1/1 Running 2 (14h ago) 14h collect-profiles-28053720-kwcgk 0/1 Completed 0 33m collect-profiles-28053735-dv5hx 0/1 Completed 0 18m collect-profiles-28053750-8xbmw 0/1 Completed 0 3m6s olm-operator-66658fffbb-f2n9f 1/1 Running 0 14h package-server-manager-654759688-tjlc9 1/1 Running 0 14h vm03500 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-wdq9b 1/1 Running 0 17h collect-profiles-28053720-jcmwf 0/1 Completed 0 33m collect-profiles-28053735-tjw5j 0/1 Completed 0 18m collect-profiles-28053750-5mjq9 0/1 Completed 0 3m6s olm-operator-66658fffbb-q92bg 1/1 Running 0 17h package-server-manager-654759688-2z656 1/1 Running 0 17h vm03654 NAME READY STATUS RESTARTS AGE catalog-operator-94b8bfddc-vq9wt 1/1 Running 0 17h collect-profiles-28053720-dlknz 0/1 Completed 0 33m collect-profiles-28053735-mshs7 0/1 Completed 0 18m collect-profiles-28053750-86xrc 0/1 Completed 0 3m6s olm-operator-66658fffbb-5qd99 1/1 Running 0 17h
Description of problem:
Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-16796. The following is the description of the original issue:
—
Description of problem:
Observation from CISv1.4 pdf: 1.1.1 Ensure that the API server pod specification file permissions are set to 600 or more restrictive “Ensure that the API server pod specification file has permissions of 600 or more restrictive. OpenShift 4 deploys two API servers: the OpenShift API server and the Kube API server. The OpenShift API server delegates requests for Kubernetes objects to the Kube API server. The OpenShift API server is managed as a deployment. The pod specification yaml for openshift-apiserver is stored in etcd. The Kube API Server is managed as a static pod. The pod specification file for the kube-apiserver is created on the control plane nodes at /etc/kubernetes/manifests/kube-apiserver-pod.yaml. The kube-apiserver is mounted via hostpath to the kube-apiserver pods via /etc/kubernetes/static-pod-resources/kube-apiserver-pod.yaml with permissions 600.” To conform with CIS benchmarksChange, the pod specification file for the kube-apiserver /etc/kubernetes/static-pod-resources/kube-apiserver-pod.yaml files should be updated to 600. $ for i in $( oc get pods -n openshift-kube-apiserver -l app=openshift-kube-apiserver -o name ) do oc exec -n openshift-kube-apiserver $i -- \ stat -c %a /etc/kubernetes/static-pod-resources/kube-apiserver-pod.yaml done 644 644 644
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
The permission of the pod specification file for the kube-apiserver is 644.
Expected results:
The permission of the pod specification file for the kube-apiserver should be updated to 600.
Additional info:
PR: https://github.com/openshift/library-go/commit/19a42d2bae8ba68761cfad72bf764e10d275ad6e
Description of problem:
There is forcedns dispatcher script added by assisted installed installation process that create etc/resolv.conf
This script has no shebang that caused installation to fail as no resolv.conf was generated.
I order to fix upgrades in already installed clusters we need to workaround this issue.
Version-Release number of selected component (if applicable):
4.13.0
How reproducible:
Happens every time
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Dockerfile.upi.ci.rhel8 does not work with the following error: [3/3] STEP 26/32: RUN chown 1000:1000 /output && chmod -R g=u "$HOME/.bluemix/" chmod: cannot access '/root/.bluemix/': No such file or directory error: build error: building at STEP "RUN chown 1000:1000 /output && chmod -R g=u "$HOME/.bluemix/"": while running runtime: exit status 1
Version-Release number of selected component (if applicable):
master (and possibly all other branches where the ibmcli tool was introduced)
How reproducible:
always
Steps to Reproduce:
1. Try to use Dockerfile.ci.upi.rhel8 2. 3.
Actual results:
[3/3] STEP 26/32: RUN chown 1000:1000 /output && chmod -R g=u "$HOME/.bluemix/" chmod: cannot access '/root/.bluemix/': No such file or directory error: build error: building at STEP "RUN chown 1000:1000 /output && chmod -R g=u "$HOME/.bluemix/"": while running runtime: exit status 1
Expected results:
No failures
Additional info:
We should also change the downloading of the govc image with curl to importing it from the cached container in quay.io, as it is done in Dockerfile.ci.upi
AWS Local Zone Support for OCP UPI/IPI
Current AWS Based OCP deployment models do not address Local Zones which offer lower latency and geo-proximity to OCP Cluster Consumers.
OCP Install Support for AWS Local Zones will address Customer Segments where low latency and data locality requirements enforce as deal breaker/show-stopper for our sales teams engagements.
Description of problem:
When users are trying to DuplicateClusterRoleBinding and Edit ClusterRoleBinding subject in RHOCP web console , getting below error : " Error Loading : Name parameter invalid: "system%3Acontroller%3A<name-of-role-ref>": may not contain '%' "
Version-Release number of selected component (if applicable):
Tested in OCP 4.12.18
How reproducible:
Always
Steps to Reproduce:
1. Open OpenShift web console 2. Select project : Openshift 3. Under User management -> Click Rolebindings 4. Look for any RoleBinding having Role Ref with format `system:<name>` 5. At the end of that line, click on 3 dots where below options will be available : - Duplicate ClusterRoleBinding - Edit ClusterroleBinding subject 6. Select/click on any of the option
Actual results:
After selecting Duplicate ClusterRoleBinding or Edit ClusterroleBinding subject, getting below error : Error Loading : Name parameter invalid: "system%3AXXX": may not contain '%'
Expected results:
After selecting Duplicate ClusterRoleBinding or Edit ClusterroleBinding subject, the correct/expected web page must be open.
Additional info:
When Duplicate or Edit RoleBinding `registry-registry-role` with Role Ref `system:registry` , it is working as expected. When Duplicate or Edit RoleBinding `system:sdn-readers` with Role Ref `system:sdn-reader` , getting below error : Error Loading : Name parameter invalid: "system%3Asdn-readers": may not contain '%' Duplicate ClusterRoleBinding or Edit ClusterRoleBindingBut subject working for few RoleBindings only (having Role ref system:<name>). Screenshots are attached here : https://drive.google.com/drive/folders/1QHpdensG2gKx0tSv1zkF7Qiyert6eaSg?usp=sharing
Description of problem:
The topology page is crashed
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Visit developer console 2. Topology view 3.
Actual results:
Error message: TypeError Description: e is null Component trace: f@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~app/code-refs/actions~delete-revision~dev-console-add~dev-console-deployImage~dev-console-ed~cf101ec3-chunk-5018ae746e2320e4e737.min.js:26:14244 5363/t.a@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:177913 u@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:275718 8248/t.a<@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:475504 i@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:470135 withFallback() 5174/t.default@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:78258 s@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:237096 [...] ne<@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1592411 r@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:36:125397 t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:58042 t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:60087 t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:54647 re@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1592722 t.a@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:791129 t.a@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1062384 s@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:613567 t.a@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:141:244663
Expected results:
No error should be there
Additional info:
Cloud Pak Operator is installed
Description of problem:
In ROSA, user can be specified an HostPrefix, but we are currently not passing it to the HostedCluster CR. Trying to fix it, it seems that we are not setting up it correctly in the Nodes.
Version-Release number of selected component (if applicable):
4.12.16
How reproducible:
Always
Steps to Reproduce:
1. Create an HC. Inside the spec add networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 25 2. Deploy the HC. Check its configuration.
Actual results:
oc get network cluster is showing the right config (see attachment) An oc describe node is always showing a /24 hostPrefix. Note that this is valid also with the default value of /23. In the node, under podCIDR I always see something like PodCIDR: 10.128.1.0/24 PodCIDRs: 10.128.1.0/24
Expected results:
I would expect the pod cidr mask to be reflected in the pod configuration
Additional info:
pod cidr is correctly set
Description of problem:
Running through instructions for a smoke test on 4.14, the DNS record is incorrectly created for the Gateway. It is missing a trailing dot in the dnsName.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1.Run through the steps in https://github.com/openshift/network-edge-tools/blob/2fd044d110eb737c94c8b86ea878a130cae0d03e/docs/blogs/EnhancedDevPreviewGatewayAPI/GettingStarted.md until the step "oc get dnsrecord -n openshift-ingress" 2. Check the status of the DNS record: "oc get dnsrecord xxx -n openshift-ingress -ojson | jq .status.zones[].conditions"
Actual results:
The status shows error conditions with a message like 'The DNS provider failed to ensure the record: googleapi: Error 400: Invalid value for ''entity.change.additions[*.gwapi.apps.ci-ln-3vxsgxb-72292.origin-ci-int-gce.dev.rhcloud.com][A].name'': ''*.gwapi.apps.ci-ln-3vxsgxb-72292.origin-ci-int-gce.dev.rhcloud.com'', invalid'
Expected results:
The status of the DNS record should show a successful publishing of the record.
Additional info:
Backport to 4.13.z
When the user specifies the 'vendor' hint, it actually checks for the value of the 'model' hint in the vendor field.
Description of problem:
The title on Overview page has changed to "Cluster · Red Hat OpenShift" instead of "Overview · Red Hat OpenShift" that we had starting from 4.11.
Version-Release number of selected component (if applicable):
OCP 4.14
How reproducible:
Install OpenShift 4.14, login to management console and navigate to Home / Overview
Steps to Reproduce:
1. Install OpenShift 4.14 2. login to management console 3. Navigate to Home / Overview 4. Load the HTML DOM and verify the HTML node <title>; title is also visible when hovering on the opened tab in Chrome or Firefox
Actual results:
Cluster · Red Hat OpenShift HTML node: <title data-telemetry="Cluster" data-react-helmet="data-telemetry" xpath="1">Cluster · Red Hat OpenShift</title>
Expected results:
Overview · Red Hat OpenShift
Additional info:
started from 4.11 the title on that page was always Overview · Red Hat OpenShift. UI tests rely on consistent titles to detect currently opened web page. * It is important to notice the change has an effect on accessibility, since it is a common accessibility feature to navigate with the text speech.
We'll do another pass of updates in the ironic containers
Description of problem:
Azure managed identity role assignments created using 'ccoctl azure' sub-commands are not cleaned up when running 'ccoctl azure delete'
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
100%
Steps to Reproduce:
1. Create Azure workload identity infrastructure using 'ccoctl azure create-all' 2. Delete Azure workload identity infrastructure using 'ccoctl azure delete' 3. Observe lingering role assignments in either the OIDC resource group if not deleted OR in the DNS Zone resource group if the OIDC resource group is deleted by providing '--delete-oidc-resource-group'.
Actual results:
Role assignments for managed identities are not deleted following 'ccoctl azure delete'
Expected results:
Role assignments for managed identities are deleted following 'ccoctl azure delete'
Additional info:
Description of problem:
Cluster Provisioning fails with the message: Internal error: failed to fetch instance type, this error usually occurs if the region or the instance type is not found This is likely because OCM uses GCP custom machine types, for example custom-4-16384 and now the installer is validating machine types per zone (see GetMachineTypeWithZones function), which don't include custom machine types. See https://cloud.google.com/compute/docs/instances/creating-instance-with-custom-machine-type#gcloud for more details.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
ocm create cluster cluster001 --provider=gcp --ccs=true --region=us-east1 --service-account-file=token.json --version="4.14.0-0.nightly-2023-08-02-102121-nightly" 2.
Actual results:
Cluster installation fails
Expected results:
Cluster installation succeeds
Additional info:
As a developer, I would like the Getting Started page to use numbered list so that it is easier to point people to specific sections of the document.
As a developer, I would like the Contribute page to be a numbered list so that it is easier to point people to specific line items of the document.
Description of problem:
Library-go contains code for creating token requests that should be reused by all OpenShift components. Because of time-constraints, this code did not make it to `oc` in the past. Fix that to prevent code out-of-sync issues.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. see if https://github.com/openshift/oc/pull/991 merged
Actual results:
it hasn't merged at the time of writing this bug
Expected results:
it's merged
Additional info:
Description of problem:
When adding a "Git Repository" (a tekton or pipelines Repository) and enter a GitLab or Bitbucket PAC repository the created Repository resource is invalid.
Version-Release number of selected component (if applicable):
411-4.13
How reproducible:
Always
Steps to Reproduce:
Setup a PAC git repo, you can mirror these projects if you want: https://github.com/jerolimov/nodeinfo-pac
For GitHub you need setup
For GitLab:
For Bitbucket:
On a cluster bot instance:
Actual results:
The GitLab created resource looks like this:
apiVersion: pipelinesascode.tekton.dev/v1alpha1
kind: Repository
metadata:
name: gitlab-nodeinfo-pac
spec:
git_provider:
secret:
key: provider.token
name: gitlab-nodeinfo-pac-token-gfr66
url: gitlab.com # missing schema
webhook_secret:
key: webhook.secret
name: gitlab-nodeinfo-pac-token-gfr66
url: 'https://gitlab.com/jerolimov/nodeinfo-pac'
The Bitbucket resource looks like this:
apiVersion: pipelinesascode.tekton.dev/v1alpha1
kind: Repository
metadata:
name: bitbucket-nodeinfo-pac
spec:
git_provider:
secret:
key: provider.token
name: bitbucket-nodeinfo-pac-token-9pf75
url: bitbucket.org # missing schema and invalid API URL !
webhook_secret: # don't entered a webhook URL, see OCPBUGS-7035
key: webhook.secret
name: bitbucket-nodeinfo-pac-token-9pf75
url: 'https://bitbucket.org/jerolimov/nodeinfo-pac'
The pipeline-as-code controller Pod log contains some error messages and no PipelineRun is created.
Expected results:
For GitLab:
apiVersion: pipelinesascode.tekton.dev/v1alpha1 kind: Repository metadata: name: gitlab-nodeinfo-pac spec: git_provider: secret: key: provider.token name: gitlab-nodeinfo-pac-token-gfr66 url: https://gitlab.com webhook_secret: key: webhook.secret name: gitlab-nodeinfo-pac-token-gfr66 url: 'https://gitlab.com/jerolimov/nodeinfo-pac'
Bitbucket:
A working example:
apiVersion: pipelinesascode.tekton.dev/v1alpha1
kind: Repository
metadata:
name: bitbucket-nodeinfo-pac
spec:
git_provider:
user: jerolimov
secret:
key: provider.token
name: bitbucket-nodeinfo-pac-token-9pf75
webhook_secret:
key: webhook.secret
name: bitbucket-nodeinfo-pac-token-9pf75
url: 'https://bitbucket.org/jerolimov/nodeinfo-pac'
A PipelineRun should be created for each push to the git repo.
Additional info:
Description of problem:
PipelineRun default template name has been updated in the backend in Pipeline operator 1.10, So we need to update the name in the UI code as well.
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/33
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Seeing `Secret {{newImageSecret}} was created.` string for the created Image pull secret alert in the Container image flow.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate +Add page 2. Open the Container Image form 3. click on Create an Image pull secret link and create a secret
Actual results:
Secret {{newImageSecret}} was created. get render in the alert
Expected results:
Secret <-Secret name-> was created. should render in the alert
Additional info:
Description of problem:
https://issues.redhat.com//browse/OCPBUGS-10342 tracked the issue when the number of replicas exceeded the number of hosts. However, it does not detect the case when the number of hosts exceeds the number of replicas as it was not counting the hosts correctly. Fix to detect this case correctly.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Set compute replicas in install-config.yaml 2. Add hosts in agent-config.yaml - 3 with role of master and more than 2 with role of worker. 3. The installation will fail and following error could be seen in the journal Jun 12 01:10:57 master-0 start-cluster-installation.sh[3879]: Hosts known and ready for cluster installation (5/3)
Actual results:
No warning regarding the number of configured hosts
Expected results:
A warning about the number of configured hosts not matching the replicas.
Additional info:
Derscription of problem:
On a hypershift cluster that has public certs for OAuth configured, the console reports a x509 certificate error when attempting to display a token
Version-Release number of selected component (if applicable):
4.12.z
How reproducible:
always
Steps to Reproduce:
1. Create a hosted cluster configured with a letsencrypt certificate for the oauth endpoint. 2. Go to the console of the hosted cluster. Click on the user icon and get token.
Actual results:
The console displays an oauth cert error
Expected results:
The token displays
Additional info:
The hcco reconciles the oauth cert into the console namespace. However, it is only reconciling the self-signed one and not the one that was configured through .spec.configuration.apiserver of the hostedcluster. It needs to detect the actual cert used for oauth and send that one.
Description of the problem:
BE 2.15.x, API and Ingress VIPs values doesn't have validation for broadcast IPs (i.e. if network is 192.168.123.0/24 --> 192.168.123.0 and 192.168.123.255).
How reproducible:
100%
Steps to reproduce:
1. Create cluster with Ingress or API vip with broadcast IP
2.
3.
Actual results:
Expected results:
BE should block those IPs
Description of problem:
Missing workload annotations from deployments. This is in relation to the openshift/platform-operator repo. Missing annotations. Namespace name, `workload.openshift.io/allowed: management` `target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'`. That annotation is required for the admission webhook to modify the resource for workload pinning. Related Enhancements: https://github.com/openshift/enhancements/pull/703 https://github.com/openshift/enhancements/pull/1213
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
KCM crashes when Topology cache's HasPopulatedHints method attempts concurrent map access Miciah has started working on the upstream fix and we need to bring in the changes into openshift/kubernetes as soon as we can https://redhat-internal.slack.com/archives/C01CQA76KMX/p1684876782205129 for more context
Version-Release number of selected component (if applicable):
How reproducible:
CI 4.14 upgrade jobs run into this problem quite often: https://search.ci.openshift.org/?search=pkg%2Fcontroller%2Fendpointslice%2Ftopologycache%2Ftopologycache.go&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Steps to Reproduce:
Actual results:
KCM crashing
Expected results:
KCM not crashing
Additional info:
We are pushing to find a resolution for OCPBUGS-11591 and the SDN team has identified a key message that appears related in the system journald logs:
Apr 12 11:53:51.395838 ci-op-xs3rnrtc-2d4c7-4mhm7-worker-b-dwc7w ovs-vswitchd[1124]: ovs|00002|timeval(urcu4)|WARN|Unreasonably long 109127ms poll interval (0ms user, 0ms system)
We should detect this in origin and create an interval so it can be charted in the timelines, as well as a unit test that fails if detected so we can see where it's happening.
Ovnkube-node container max memory usage was 110 MiB with 4.14.0-0.nightly-2023-05-18-231932 image and now it is 530 MiB with 4.14.0-0.nightly-2023-07-31-181848 image, for the same test (cluster-density-v2 with 800 iterations, churn=false) on 120 node environment. We observed the same pattern in the OVN-IC environment as well.
Note: As churn is false, we are calculating memory usage for only resource creation.
Grafana panel for OVN with 4.14.0-0.nightly-2023-05-18-231932 image -
https://grafana.rdu2.scalelab.redhat.com:3000/dashboard/snapshot/H9pAb07fsPEOFyd5dhKLFP602A7S18uC
Grafana panel for OVN with 4.14.0-0.nightly-2023-07-31-181848 image -
https://grafana.rdu2.scalelab.redhat.com:3000/dashboard/snapshot/8158bJgv3e4P2uiVernbc2E5ypBWFYHt
As the test was successfully run in the CI, we couldn't collect a must-gather. I can provide must-gather and pprof data if needed.
We observed 100 MiB to 550 MiB increase in OVN-IC between 4.14.0-0.nightly-2023-06-12-141936 and 4.14.0-0.nightly-2023-07-30-191504 versions.
OVN-IC 4.14.0-0.nightly-2023-06-12-141936
https://grafana.rdu2.scalelab.redhat.com:3000/dashboard/snapshot/o5SXLdHIL8whsdgaMyXwWamipBP8J2fF
OVN-IC 4.14.0-0.nightly-2023-07-30-191504
https://grafana.rdu2.scalelab.redhat.com:3000/dashboard/snapshot/NMuSQx7YAJ9jokoKMl6Me9StHp33tjwD
So that they can review and approve most observability UI changes that require console code changes.
Description of the problem:
When invoking installation with assisted-service scripts (make deploy-all), as being done in installation for PSI env, the pods for assisted-service and assisted-image-service produce warning about readiness-probe validation that is failing:
Readiness probe failed: Get "http://172.28.8.39:8090/ready": dial tcp 172.28.8.39:8090: connect: connection refused
Those warnings are harmless, but they make people think that there is a problem with the running pods (or that they are not ready yet, even though the pods are marked as ready).
How reproducible:
100%
Steps to reproduce:
1. invoke make deploy-all on PSI or other places (for some reason it doesn't reproduce on minikube)
2. inspect the pod's conditions part with oc describe, and look for warnings
Actual results:
Warnings emitted
Expected results:
No warnings should be emitted for the initial setup time of each pod. The fix just requires setting initialDelaySeconds in the readinessProbe configuration, just like we did in the template: https://github.com/openshift/assisted-service/pull/4557
see also: https://github.com/openshift/assisted-service/pull/380#pullrequestreview-490308765
Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/44
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ODC automatically loads all Camel K Kamelets from openshift-operators namespace in order to display those resources in the event sources/sinks catalog. This is not working when the Camel K operator is installed in another namespace (e.g. in Developer Sandbox the Camel K operator had to be installed in camel-k-operator namespace)
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Display event sources/sinks catalog in ODC on a cluster where Camel K is installed in a namespace other than openshift-operators (e.g. Developer Sandbox)
Steps to Reproduce:
1. Make sure to have a cluster where Knative eventing is available 2. Install Camel K operator in camel-k-operator namespace (e.g. via OLM) 3. Display the event source/sink catalog in ODC
Actual results:
No Kamelets are visible in the catalog
Expected results:
All Kamelets (automatically installed with the operator) should be visible as potential event sources/sinks in the catalog
Additional info:
The Kamelet resources are being watched in two namespaces (current user namespace and global operator namespace. https://github.com/openshift/console/blob/master/frontend/packages/knative-plugin/src/hooks/useKameletsData.ts#L12-L28 We should allow configuration of the global namespace or also add camel-k-operator namespace as 3rd place to look for installed Kamelets.
This is a clone of issue OCPBUGS-19017. The following is the description of the original issue:
—
dnsmasq isn't starting on okd-scos in the bootstrap VM
logs should it failing with "Operation not permitted"
`useExtensions` is not available in the dynamic plugin SDK, which prevents this functionality being copied to `monitoring-plugin`. `useResolvedExtensions` is available and provides the same functionality so we should use that instead.
For static pod readiness we check /readyz and /healthz endpoints for kube-apiserver. For SNO exclude openshift-apiserver from the health checks using the 'exclude' query parameter
Example:
> oc get --raw /readyz?verbose&exclude=api-openshift-apiserver-available
Should we also remove 'oauth-apiserver'?
Description of problem:
No MachineSet is created for workers if replicas == 0
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
replicas: 0 in install-config for workers
Steps to Reproduce:
1. Deploy a cluster with 0 worker 2. After deployment, list MachineSets 3. Zero can be found
Actual results:
No MachineSet found: No resources found in openshift-machine-api namespace.
Expected results:
A worker MachineSet should have been created like before.
Additional info:
We broke it during CPMS integration.
Please review the following PR: https://github.com/openshift/cloud-provider-nutanix/pull/18
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When install a cluster on IBM cloud, the image registry default to Removed, no storage configured after 4.13.0-ec.3 Image registry should use ibmcos object storage on IPI-IBM cluster https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/storage.go#L182
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-02-27-101545
How reproducible:
always
Steps to Reproduce:
1.Install an IPI cluster on IBM cloud 2.Check image registry after install successfully 3.
Actual results:
oc get config.image/cluster -o yaml spec: logLevel: Normal managementState: Removed observedConfig: null operatorLogLevel: Normal proxy: {} replicas: 1 requests: read: maxWaitInQueue: 0s write: maxWaitInQueue: 0s rolloutStrategy: RollingUpdate storage: {} unsupportedConfigOverrides: null
oc get infrastructure cluster -o yaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-03-02T02:21:06Z" generation: 1 name: cluster resourceVersion: "531" uid: 8d61a1e2-3852-40a2-bf5d-b7f9c92cda7b spec: cloudConfig: key: config name: cloud-provider-config platformSpec: type: IBMCloud status: apiServerInternalURI: https://api-int.wxjibm32.ibmcloud.qe.devcluster.openshift.com:6443 apiServerURL: https://api.wxjibm32.ibmcloud.qe.devcluster.openshift.com:6443 controlPlaneTopology: HighlyAvailable etcdDiscoveryDomain: "" infrastructureName: wxjibm32-lmqh7 infrastructureTopology: HighlyAvailable platform: IBMCloud platformStatus: ibmcloud: cisInstanceCRN: 'crn:v1:bluemix:public:internet-svcs:global:a/fdc2e14cf8bc4d53a67f972dc2e2c861:e8ee6ca1-4b31-4307-8190-e67f6925f83b::' location: eu-gb providerType: VPC resourceGroupName: wxjibm32-lmqh7 type: IBMCloud
Expected results:
Image registry should use ibmcos object storage on IPI-IBM cluster
Additional info:
Must-gather log https://drive.google.com/file/d/1N-WUOZLRjlXcZI0t2O6MXsxwnsVPDCGQ/view?usp=share_link
Description of the problem:
When patching platform and leaving umn without change the logs shows "false" instead of nil, causing us to think that the cluster will not be in a not valid state (e.g. none + umn disabled)
time="2023-06-15T09:59:54Z" level=info msg="Platform verification completed, setting platform type to none and user-managed-networking to false" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).validateUpdateCluster" file="/assisted-service/internal/bminventory/inventory.go:1928" cluster_id=468bffe8-ce24-400e-a104-b0aab378eb75 go-id=94310 pkg=Inventory request_id=2fbb74ba-4390-4f27-b6fd-ee11ac1a7895
Steps to reproduce:
1. Create cluster with platform == OCI or vSphere with UMN enabled
2. Patch the cluster with "{"platfrom": {"type": "none"}}"
Actual results:
Log shows
setting platform type to none and user-managed-networking to false
Expected results:
setting platform type to none and user-managed-networking to nil
aws-ebs-csi-driver-controller-ca ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.
CI is flaky because tests pull the "openshift/origin-node" image from Docker Hub and get rate-limited:
E0803 20:44:32.429877 2066 kuberuntime_image.go:53] "Failed to pull image" err="rpc error: code = Unknown desc = reading manifest latest in docker.io/openshift/origin-node: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit" image="openshift/origin-node:latest"
This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/929/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/16871891662673059841687189166267305984. I don't know how to search for this failure using search.ci. I discovered the rate-limiting through Loki: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22PCEB727DF2F34084E%22,%22queries%22:%5B%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fpull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator%2F1687189166267305984%5C%22%7D%20%7C%20unpack%20%7C~%20%5C%22pull%20rate%20limit%5C%22%22,%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221691086303449%22,%22to%22:%221691122303451%22%7D%7D.
This happened on 4.14 CI job.
I have observed this once so far, but it is quite obscure.
1. Post a PR and have bad luck.
2. Check Loki using the following query:
{...} {invoker="openshift-internal-ci/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/*"} | unpack | systemd_unit="kubelet.service" |~ "pull rate limit"
CI pulls from Docker Hub and fails.
CI passes, or fails on some other test failure. CI should never pull from Docker Hub.
We have been using the "openshift/origin-node" image in multiple tests for years. I have no idea why it is suddenly pulling from Docker Hub, or how we failed to notice that it was pulling from Docker Hub if that's what it was doing all along.
Description of problem:
[CSI Inline Volume admission plugin] when using deployment/statefulset/daemonset workload with inline volume doesn't record audit logs/warning correctly
Version-Release number of selected component (if applicable):
4.13.0-0.ci.test-2023-03-02-013814-ci-ln-yd4m4st-latest (nightly build also could be reproduced)
How reproducible:
Always
Steps to Reproduce:
1. Enable feature gate to auto install the csi.sharedresource csi driver 2. Add security.openshift.io/csi-ephemeral-volume-profile: privileged to CSIDriver 'csi.sharedresource.openshift.io' # scale down the cvo,cso and shared-resource-csi-driver-operator $ oc scale --replicas=0 deploy/cluster-version-operator -n openshift-cluster-version deployment.apps/cluster-version-operator scaled $oc scale --replicas=0 deploy/cluster-storage-operator -n openshift-cluster-storage-operator deployment.apps/cluster-storage-operator scaled $ oc scale --replicas=0 deploy/shared-resource-csi-driver-operator -n openshift-cluster-csi-drivers deployment.apps/shared-resource-csi-driver-operator scaled # Add security.openshift.io/csi-ephemeral-volume-profile: privileged to CSIDriver $ oc get csidriver/csi.sharedresource.openshift.io -o yaml apiVersion: storage.k8s.io/v1 kind: CSIDriver metadata: annotations: csi.openshift.io/managed: "true" operator.openshift.io/spec-hash: 4fc61ff54015a7e91e07b93ac8e64f46983a59b4b296344948f72187e3318b33 creationTimestamp: "2022-10-26T08:10:23Z" labels: security.openshift.io/csi-ephemeral-volume-profile: privileged 3. Create different workloads with inline volume in a restricted namespace $ oc apply -f examples/simple role.rbac.authorization.k8s.io/shared-resource-my-share-pod created rolebinding.rbac.authorization.k8s.io/shared-resource-my-share-pod created configmap/my-config created sharedconfigmap.sharedresource.openshift.io/my-share-pod created Error from server (Forbidden): error when creating "examples/simple/03-pod.yaml": pods "my-csi-app-pod" is forbidden: admission denied: pod my-csi-app-pod uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security enforce level that is lower than privileged Error from server (Forbidden): error when creating "examples/simple/04-deployment.yaml": deployments.apps "mydeployment" is forbidden: admission denied: pod uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security enforce level that is lower than privileged Error from server (Forbidden): error when creating "examples/simple/05-statefulset.yaml": statefulsets.apps "my-sts" is forbidden: admission denied: pod uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security enforce level that is lower than privileged 4. Add enforce: privileged label to the test ns and create different workloads with inline volume again $ oc label ns/my-csi-app-namespace security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/audit=restricted pod-security.kubernetes.io/warn=restricted --overwrite namespace/my-csi-app-namespace labeled $ oc apply -f examples/simple role.rbac.authorization.k8s.io/shared-resource-my-share-pod created rolebinding.rbac.authorization.k8s.io/shared-resource-my-share-pod created configmap/my-config created sharedconfigmap.sharedresource.openshift.io/my-share-pod created Warning: pod my-csi-app-pod uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security warn level that is lower than privileged pod/my-csi-app-pod created Warning: pod uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security warn level that is lower than privileged deployment.apps/mydeployment created daemonset.apps/my-ds created statefulset.apps/my-sts created $ oc get po NAME READY STATUS RESTARTS AGE my-csi-app-pod 1/1 Running 0 34s my-ds-cw4k7 1/1 Running 0 32s my-ds-sv9vp 1/1 Running 0 32s my-ds-v7f9m 1/1 Running 0 32s my-sts-0 1/1 Running 0 31s mydeployment-664cd95cb4-4s2cd 1/1 Running 0 33s 5. Check the api-server audit logs $ oc adm node-logs ip-10-0-211-240.us-east-2.compute.internal --path=kube-apiserver/audit.log | grep 'uses an inline volume provided by'| tail -1 | jq . | grep 'CSIInlineVolumeSecurity' "storage.openshift.io/CSIInlineVolumeSecurity": "pod uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security audit level that is lower than privileged"
Actual results:
In step 3 and step 4: deployment workloads the warning info pod name is empty statefulset/daemonset workloads the warning info doesn't display In step 5: audit logs the pod name is empty
Expected results:
In step 3 and step 4: deployment workloads the warning info pod name should be exist statefulset/daemonset workloads the warning info should display In step 5: audit logs the pod name shouldn't be empty it should record the workload type and pod specific names
Additional info:
Testdata: https://github.com/Phaow/csi-driver-shared-resource/tree/test-inlinevolume/examples/simple
Description of problem:
When running a cluster on application credentials, this event appears repeatedly: ns/openshift-machine-api machineset/nhydri0d-f8dcc-kzcwf-worker-0 hmsg/173228e527 - pathological/true reason/ReconcileError could not find information for "ci.m1.xlarge"
Version-Release number of selected component (if applicable):
How reproducible:
Happens in the CI (https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/33330/rehearse-33330-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.13-e2e-openstack-ovn-serial/1633149670878351360).
Steps to Reproduce:
1. On a living cluster, rotate the OpenStack cloud credentials 2. Invalidate the previous credentials 3. Watch the machine-api events (`oc -n openshift-machine-api get event`). A `Warning` type of issue could not find information for "name-of-the-flavour" will appear. If the cluster was installed using a password that you can't invalidate: 1. Rotate the cloud credentials to application credentials 2. Restart MAPO (`oc -n openshift-machine-api get pods -o NAME | xargs -r oc -n openshift-machine-api delete`) 3. Rotate cloud credentials again 4. Revoke the first application credentials you set 5. Finally watch the events (`oc -n openshift-machine-api get event`) The event signals that MAPO wasn't able to update flavour information on the MachineSet status.
Actual results:
Expected results:
No issue detecting the flavour details
Additional info:
Offending code likely around this line: https://github.com/openshift/machine-api-provider-openstack/blob/bcb08a7835c08d20606d75757228fd03fbb20dab/pkg/machineset/controller.go#L116
Currently the assisted installer adds to the ISO a dracut hook that is executed early during the boot process. That hook generates the NetworkManager configuration files that will be used during the boot and also once the machine is installed. But that hook is not guaranteed to run before NetworkManager, and the files it generates may not be loaded by NetworkManager at the right time. We have seen such issues in the recent upgrade from RHEL 8 to RHEL 9 that is part of OpenShift 4.13. The RCHOS team recommends replacing it with a systemd unit that runs before NetworkManager.
Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/29
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/53
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When creating machine and attaching Azure Ultra Disks as Data Disks in Arm cluster, machine is Provisioned, but checked in azure web console, instance is failed with error ZonalAllocationFailed.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-arm64-2023-03-22-204044
How reproducible:
Always
Steps to Reproduce:
/// Not Needed up to point 6 //// 1. Make sure storagecluster is already present kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: ultra-disk-sc provisioner: disk.csi.azure.com # replace with "kubernetes.io/azure-disk" if aks version is less than 1.21 volumeBindingMode: WaitForFirstConsumer # optional, but recommended if you want to wait until the pod that will use this disk is created parameters: skuname: UltraSSD_LRS kind: managed cachingMode: None diskIopsReadWrite: "2000" # minimum value: 2 IOPS/GiB diskMbpsReadWrite: "320" # minimum value: 0.032/GiB 2. Create a new custom secret using the worker-data-secret $ oc -n openshift-machine-api get secret worker-user-data --template='{{index .data.userData | base64decode}}' | jq > userData.txt 3. Edit the userData.txt by adding below part just before the ending '}' and add a comma "storage": { "disks": [ { "device": "/dev/disk/azure/scsi1/lun0", "partitions": [ { "label": "lun0p1", "sizeMiB": 1024, "startMiB": 0 } ] } ], "filesystems": [ { "device": "/dev/disk/by-partlabel/lun0p1", "format": "xfs", "path": "/var/lib/lun0p1" } ] }, "systemd": { "units": [ { "contents": "[Unit]\nBefore=local-fs.target\n[Mount]\nWhere=/var/lib/lun0p1\nWhat=/dev/disk/by-partlabel/lun0p1\nOptions=defaults,pquota\n[Install]\nWantedBy=local-fs.target\n", "enabled": true, "name": "var-lib-lun0p1.mount" } ] } 4. Extract the disabling template value using below $ oc -n openshift-machine-api get secret worker-user-data --template='{{index .data.disableTemplating | base64decode}}' | jq > disableTemplating.txt 5. Merge the two files to create a datasecret file to be used $ oc -n openshift-machine-api create secret generic worker-user-data-x5 --from-file=userData=userData.txt --from-file=disableTemplating=disableTemplating.txt /// Not needed up to here /// 6.modify the new machineset yaml with below datadisk being seperate field as the osDisks dataDisks: - nameSuffix: ultrassd lun: 0 diskSizeGB: 4 # The same issue on the machine status fields is reproducible on x86_64 by setting 65535 to overcome the maximum limits of the Azure accounts we use. cachingType: None deletionPolicy: Delete managedDisk: storageAccountType: UltraSSD_LRS 7. scale up machineset or delete an existing machine to force the reprovisioning.
Actual results:
Machine stuck in Provisoned phase, but check from azure, it failed $ oc get machine -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE zhsunaz3231-lds8h-master-0 Running Standard_D8ps_v5 centralus 1 4h15m zhsunaz3231-lds8h-master-0 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-0 Running zhsunaz3231-lds8h-master-1 Running Standard_D8ps_v5 centralus 2 4h15m zhsunaz3231-lds8h-master-1 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-1 Running zhsunaz3231-lds8h-master-2 Running Standard_D8ps_v5 centralus 3 4h15m zhsunaz3231-lds8h-master-2 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-2 Running zhsunaz3231-lds8h-worker-centralus1-sfhs7 Provisioned Standard_D4ps_v5 centralus 1 3m23s azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-worker-centralus1-sfhs7 Creating $ oc get machine zhsunaz3231-lds8h-worker-centralus1-sfhs7 -o yaml - lastTransitionTime: "2023-03-23T06:07:32Z" message: 'Failed to check if machine exists: vm for machine zhsunaz3231-lds8h-worker-centralus1-sfhs7 exists, but has unexpected ''Failed'' provisioning state' reason: ErrorCheckingProvider status: Unknown type: InstanceExists - lastTransitionTime: "2023-03-23T06:07:05Z" status: "True" type: Terminable lastUpdated: "2023-03-23T06:07:32Z" phase: Provisioned
Expected results:
Machine should be failed if failed in azure
Additional info:
must-gather: https://drive.google.com/file/d/1z1gyJg4NBT8JK2-aGvQCruJidDHs0DV6/view?usp=sharing
Tests were temporarily disabled by https://issues.redhat.com//browse/OCPBUGS-14964
All Alertmanager config page UI tests should be running again in CI.
Description of the problem:
Staging , Ignition override test was passing successfully before , looks like in latest code the returned api code exception changed to 500 (internal server error) .
Before that we have error 400 api code exception.
(Pdb++) cluster.patch_discovery_ignition(ignition=ignition_override) 'image_type': None, 'kernel_arguments': None, 'proxy': None, 'pull_secret': None, 'ssh_authorized_key': None, 'static_network_config': None} (/home/benny/assisted-test-infra/src/service_client/assisted_service_api.py:169) *** assisted_service_client.rest.ApiException: (500) Reason: Internal Server Error HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'vary': 'Accept-Encoding,Origin', 'date': 'Sun, 11 Jun 2023 04:26:53 GMT', 'content-length': '141', 'x-envoy-upstream-service-time': '1538', 'server': 'envoy', 'set-cookie': 'bd0de3dae0f495ebdb32e3693e2b9100=de3a34d29f1e78d0c404b6c5e84b502b; path=/; HttpOnly; Secure; SameSite=None'}) HTTP response body: {"code":"500","href":"","id":500,"kind":"Error","reason":"The ignition archive size (365 KiB) is over the maximum allowable size (256 KiB)"} Traceback (most recent call last): File "/home/benny/assisted-test-infra/src/assisted_test_infra/test_infra/helper_classes/cluster.py", line 501, in patch_discovery_ignition self._infra_env.patch_discovery_ignition(ignition_info=ignition) File "/home/benny/assisted-test-infra/src/assisted_test_infra/test_infra/helper_classes/infra_env.py", line 116, in patch_discovery_ignition self.api_client.patch_discovery_ignition(infra_env_id=self.id, ignition_info=ignition_info) File "/home/benny/assisted-test-infra/src/service_client/assisted_service_api.py", line 407, in patch_discovery_ignition self.update_infra_env(infra_env_id=infra_env_id, infra_env_update_params=infra_env_update_params) File "/home/benny/assisted-test-infra/src/service_client/assisted_service_api.py", line 170, in update_infra_env self.client.update_infra_env(infra_env_id=infra_env_id, infra_env_update_params=infra_env_update_params) File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/api/installer_api.py", line 1696, in update_infra_env (data) = self.update_infra_env_with_http_info(infra_env_id, infra_env_update_params, **kwargs) # noqa: E501 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/api/installer_api.py", line 1767, in update_infra_env_with_http_info return self.api_client.call_api( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/api_client.py", line 325, in call_api return self.__call_api(resource_path, method, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/api_client.py", line 157, in __call_api response_data = self.request( ^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/api_client.py", line 383, in request return self.rest_client.PATCH(url, ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/rest.py", line 289, in PATCH return self.request("PATCH", url, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.0/lib/python3.11/site-packages/assisted_service_client/rest.py", line 228, in request raise ApiException(http_resp=r) (Pdb++)
How reproducible:
Always
Steps to reproduce:
Run test:
test_discovery_ignition_exceed_size_limit
Actual results:
Returns error 500
Expected results:
erorr 400
Please review the following PR: https://github.com/openshift/telemeter/pull/452
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The upgrade to 4.14.0-ec.2 from 4.14.0-ec.1 was blocked by the error message on the UI: Could not update rolebinding "openshift-monitoring/cluster-monitoring-operator-techpreview-only" (531 of 993): the object is invalid, possibly due to local cluster configuration
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Unblocked by oc --context build02 delete rolebinding cluster-monitoring-operator-techpreview-only -n openshift-monitoring --as system:admin rolebinding.rbac.authorization.k8s.io "cluster-monitoring-operator-techpreview-only" deleted
Description of problem:
Some of the components in Console Dynamic Plugin SDK take `GroupVersionKind` type, which is string for the `groupVersionKind` prop, but instead they should be using new `K8sGroupVersionKind` object.
Version-Release number of selected component (if applicable):
How reproducible:
always
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/192
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The agent-config-template creation command give no INFO log in the output, however, it generates the file.
Version-Release number of selected component (if applicable):
v4.13
How reproducible:
$ openshift-install agent create agent-config-template --dir=./foo
Steps to Reproduce:
1. 2. 3.
Actual results:
$ openshift-install agent create agent-config-template --dir=./foo INFO
Expected results:
Additional info:
$ openshift-install agent create agent-config-template --dir=./foo INFO Created Agent Config Template in . directory
Description of problem:
On the openshift/console master branch, a devfile import fails by default. I have noticed that when a repository url has a .git extension, the pod fails due to a bug where the container image is trying to pull from dockerhub rather than the openshift image registry. For example, the container image is Image: devfile-sample-code-with-quarkus.git:latest but the image from the imagestreamtag is image-registry.openshift-image-registry.svc:5000/maysun/devfile-sample-code-with-quarkus.git@sha256:e6aa9d29be48b33024eb271665d11a7557c9f140c9bd58aeb19fe4570fffb421. A pod describe shows the expected error "Failed to pull image "devfile-sample-code-with-quarkus.git:latest": rpc error: code = Unknown desc = reading manifest latest in docker.io/library/devfile-sample-code-with-quarkus.git: requested access to the resource is denied". However, during import, if you were to remove the .git extention from the repository link, the import is successful. I only see this on the master branch and it seems to be fine on my local crc which is on OpenShift version: 4.13.0
Version-Release number of selected component (if applicable):
4.13.z
How reproducible:
Always
Steps to Reproduce:
1. Build from openshift/console master 2. Import Devfile sample 3. If repo has a .git extension, pod fails with the wrong image
Actual results:
POD describe: Failed to pull image "devfile-sample-code-with-quarkus.git:latest": rpc error: code = Unknown desc = reading manifest latest in docker.io/library/devfile-sample-code-with-quarkus.git: requested access to the resource is denied
Expected results:
Successful running pod
Additional info:
Fine on Openshift 4.13.0, tested on local crc: $ crc version WARN A new version (2.23.0) has been published on https://developers.redhat.com/content-gateway/file/pub/openshift-v4/clients/crc/2.23.0/crc-macos-installer.pkg CRC version: 2.20.0+f3a947 OpenShift version: 4.13.0 Podman version: 4.4.4
This is a clone of issue OCPBUGS-5969. The following is the description of the original issue:
—
Description of problem:
Nutanix machine without enough memory stuck in Provisioning and machineset scale/delete cannot work
Version-Release number of selected component (if applicable):
Server Version: 4.12.0 4.13.0-0.nightly-2023-01-17-152326
How reproducible:
Always
Steps to Reproduce:
1. Install Nutanix Cluster Template https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/tree/master/functionality-testing/aos-4_12/ipi-on-nutanix//versioned-installer master_num_memory: 32768 worker_num_memory: 16384 networkType: "OVNKubernetes" installer_payload_image: quay.io/openshift-release-dev/ocp-release:4.12.0-x86_64 2. 3. Scale up the cluster worker machineset from 2 replicas to 40 replicas 4. Install a Infra machinesets with 3 replicas, and a Workload machinesets with 1 replica Refer to this doc https://docs.openshift.com/container-platform/4.11/machine_management/creating-infrastructure-machinesets.html#machineset-yaml-nutanix_creating-infrastructure-machinesets and config the following resource VCPU=16 MEMORYMB=65536 MEMORYSIZE=64Gi
Actual results:
1. The new infra machines stuck in 'Provisioning' status for about 3 hours. % oc get machines -A | grep Prov openshift-machine-api qili-nut-big-jh468-infra-48mdt Provisioning 175m openshift-machine-api qili-nut-big-jh468-infra-jnznv Provisioning 175m openshift-machine-api qili-nut-big-jh468-infra-xp7xb Provisioning 175m 2. Checking the Nutanix web console, I found infra machine 'qili-nut-big-jh468-infra-jnznv' had the following msg " No host has enough available memory for VM qili-nut-big-jh468-infra-48mdt (8d7eb6d6-a71e-4943-943a-397596f30db2) that uses 4 vCPUs and 65536MB of memory. You could try downsizing the VM, increasing host memory, power off some VMs, or moving the VM to a different host. Maximum allowable VM size is approximately 17921 MB " infra machine 'qili-nut-big-jh468-infra-jnznv' is not round infra machine 'qili-nut-big-jh468-infra-xp7xb' is in green without warning. But In must gather I found some error: 03:23:49openshift-machine-apinutanixcontrollerqili-nut-big-jh468-infra-xp7xbFailedCreateqili-nut-big-jh468-infra-xp7xb: reconciler failed to Create machine: failed to update machine with vm state: qili-nut-big-jh468-infra-xp7xb: failed to get node qili-nut-big-jh468-infra-xp7xb: Node "qili-nut-big-jh468-infra-xp7xb" not found 3. Scale down the worker machineset from 40 replicas to 30 replicas can not work. Still have 40 Running worker machines and 40 Ready nodes after about 3 hours. % oc get machinesets -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api qili-nut-big-jh468-infra 3 3 176m openshift-machine-api qili-nut-big-jh468-worker 30 30 30 30 5h1m openshift-machine-api qili-nut-big-jh468-workload 1 1 176m % oc get machines -A | grep worker| grep Running -c 40 % oc get nodes | grep worker | grep Ready -c 40 4. I delete the infra machineset, but the machines still in Provisioning status and won't get deleted % oc delete machineset -n openshift-machine-api qili-nut-big-jh468-infra machineset.machine.openshift.io "qili-nut-big-jh468-infra" deleted % oc get machinesets -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api qili-nut-big-jh468-worker 30 30 30 30 5h26m openshift-machine-api qili-nut-big-jh468-workload 1 1 3h21m % oc get machines -A | grep -v Running NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api qili-nut-big-jh468-infra-48mdt Provisioning 3h22m openshift-machine-api qili-nut-big-jh468-infra-jnznv Provisioning 3h22m openshift-machine-api qili-nut-big-jh468-infra-xp7xb Provisioning 3h22m openshift-machine-api qili-nut-big-jh468-workload-qdkvd 3h22m
Expected results:
The new infra machines should be either Running or Failed. Cluster worker machinest scaleup and down should not be impacted.
Additional info:
must-gather download url will be added to the comment.
Description of problem:
On an SNO node one of the CatalogSources gets deleted after multiple reboots.
In the initial stage we have 2 catalogsources:
$ oc get catsrc -A
NAMESPACE NAME DISPLAY TYPE PUBLISHER AGE
openshift-marketplace certified-operators Intel SRIOV-FEC Operator grpc Red Hat 20h
openshift-marketplace redhat-operators Red Hat Operators Catalog grpc Red Hat 18h
After running several node reboots, one of the catalogsouce doesn't show up anylonger:
$ oc get catsrc -A
NAMESPACE NAME DISPLAY TYPE PUBLISHER AGE
openshift-marketplace certified-operators Intel SRIOV-FEC Operator grpc Red Hat 21h
Version-Release number of selected component (if applicable):
4.11.0-fc.3
How reproducible:
Inconsistent but reproducible
Steps to Reproduce:
1. Deploy and configure SNO node via ZTP process. Configuration sets up 2 CatalogSources in a restricted environment for redhat-operators and certified-operators
2. Reboot the node via `sudo reboot` several times
3. Check catalogsources
Actual results:
$ oc get catsrc -A
NAMESPACE NAME DISPLAY TYPE PUBLISHER AGE
openshift-marketplace certified-operators Intel SRIOV-FEC Operator grpc Red Hat 22h
Expected results:
All catalogsources created initially are still present.
Additional info:
Attaching must-gather.
Description of problem:
Users cannot install single-node-openshift if the hostname contains the word etcd
Version-Release number of selected component (if applicable):
Probably since 4.8
How reproducible:
100%
Steps to Reproduce:
1. Install SNO with either Assisted or BIP 2. Make sure node hostname is etcd-1 (e.g. via DHCP hostname)
Actual results:
Bootstrap phase never ends
Expected results:
Bootstrap phase should complete successfully
Additional info:
This code is the likely culprit - it uses a naive way to check if etcd is running, accidentally capturing the node name (which contains etcd) in the crictl output as "evidence" that etcd is still running, so it never completes.
See OCPBUGS-15826 (aka AITRIAGE-7677)
Description of problem:
CheckNodePerf is running on non master nodes, when the worker role label is not present.
Version-Release number of selected component (if applicable):
How reproducible:
in a Vmware cluster create a infra MCP, and label a node as role:infra vsphere-problem-detector-operator will produce CheckNodePerf alerts and logs like CheckNodePerf: xxxxxx failed: master node has disk latency of greater than 100ms https://docs.openshift.com/container-platform/4.10/machine_management/creating-infrastructure-machinesets.html#creating-infra-machines_creating-infrastructure-machinesets
Steps to Reproduce:
1. 2. 3.
Actual results:
CheckNodePerf: xxxxx failed: master node has disk latency of greater than 100ms
Expected results:
no log entry, and no alert
Additional info:
The code only considers worker and master labels, also very complex nesting of conditions. https://github.com/openshift/vsphere-problem-detector/blob/ca408db88a70cfa5aefa3128dff971a555994c29/pkg/check/node_perf.go#L133-L143
This will allow the installer to depend on just the client/api/models modules, and not pull in all of the dependencies of the service (such as libnmstate).
Regular sync with upstream source on metal3
Description of problem:
When deploying a disconnected cluster with the installer, the image-registry operator will fail to deploy because it cannot reach the COS endpoint.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Deploy a disconnected cluster with the installer 2. Watch the image-registry operator, it will fail to deploy
Actual results:
image-registry operator doesn't deploy because the COS endpoint is unreachable.
Expected results:
image-registry operator should deploy
Additional info:
Fix identified.
Sanitize OWNERS/OWNER_ALIASES:
1) OWNERS must have:
component: "Storage / Kubernetes External Components"
2) OWNER_ALIASES must have all team members of Storage team.
This is a clone of issue OCPBUGS-18386. The following is the description of the original issue:
—
How reproducible:
Always
Steps to Reproduce:
1. the Kubernetes API introduces a new Pod Template parameter (`ephemeral`) 2. this parameter is not in the allowed list of the default SCC 3. customer is not allowed to edit the default SCCs nor we have a mechanism in place to update the built in SCCs AFAIK 4. users of existing clusters cannot use the new parameter without creating manual SCCs and assigning this SCC to service accounts themselves which looks clunky. This is documented in https://access.redhat.com/articles/6967808
Actual results:
Users of existing clusters cannot use ephemeral volumes after an upgrade
Expected results:
Users of existing clusters *can* use ephemeral volumes after an upgrade
Current status
Description of problem:
Deployment of a standard masters+workers cluster using 4.13.0-rc.6 does not configure the cgroup structure according to OCPNODE-1539
Version-Release number of selected component (if applicable):
OCP 4.13.0-rc.6
How reproducible:
Always
Steps to Reproduce:
1. Deploy the cluster 2. Check for presence of /sys/fs/cgroup/cpuset/system* 3. Check the status of cpu balancing of the root cpuset cgroup (should be disabled)
Actual results:
No system cpuset exists and all services are still present in the root cgroup with cpu balancing enabled.
Expected results:
Additional info:
The code has a bug we missed. It is nested under the Workload partitioning check on line https://github.com/haircommander/cluster-node-tuning-operator/blob/123e26df30c66fd5c9836726bd3e4791dfd82309/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L251
This is a clone of issue OCPBUGS-18999. The following is the description of the original issue:
—
Description of problem:
Image pulls fail with http status 504, gateway timeout until image registry pods are restarted.
Version-Release number of selected component (if applicable):
4.13.12
How reproducible:
Intermittent
Steps to Reproduce:
1. 2. 3.
Actual results:
Images can't be pulled: podman pull registry.ci.openshift.org/ci/applyconfig:latest Trying to pull registry.ci.openshift.org/ci/applyconfig:latest... Getting image source signatures Error: reading signatures: downloading signatures for sha256:83c1b636069c3302f5ba5075ceeca5c4a271767900fee06b919efc3c8fa14984 in registry.ci.openshift.org/ci/applyconfig: received unexpected HTTP status: 504 Gateway Time-out Image registry pods contain errors: time="2023-09-01T02:25:39.596485238Z" level=warning msg="error authorizing context: access denied" go.version="go1.19.10 X:strictfipsruntime" http.request.host=registry.ci.openshift.org http.request.id=3e805818-515d-443f-8d9b-04667986611d http.request.method=GET http.request.remoteaddr=18.218.67.82 http.request.uri="/v2/ocp/4-dev-preview/manifests/sha256:caf073ce29232978c331d421c06ca5c2736ce5461962775fdd760b05fb2496a0" http.request.useragent="containers/5.24.1 (github.com/containers/image)" vars.name=ocp/4-dev-preview vars.reference="sha256:caf073ce29232978c331d421c06ca5c2736ce5461962775fdd760b05fb2496a0"
Expected results:
Image registry does not return gateway timeouts
Additional info:
Must gather(s) attached, additional information in linked OHSS ticket.
Please review the following PR: https://github.com/openshift/router/pull/455
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Unit test failing === RUN TestNewAppRunAll/app_generation_using_context_dir newapp_test.go:907: app generation using context dir: Error mismatch! Expected <nil>, got supplied context directory '2.0/test/rack-test-app' does not exist in 'https://github.com/openshift/sti-ruby' --- FAIL: TestNewAppRunAll/app_generation_using_context_dir (0.61s)
Version-Release number of selected component (if applicable):
How reproducible:
100
Steps to Reproduce:
see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1376/pull-ci-openshift-oc-master-images/1638172620648091648
Actual results:
unit tests fail
Expected results:
TestNewAppRunAll unit test should pass
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/70
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
This Jira is filed to track upstream issue (fix and backport) https://github.com/kubernetes-sigs/azuredisk-csi-driver/issues/1893
Version-Release number of selected component (if applicable):
4.14
Description of problem:
An un-privileged user with cluster-readers role cannot view NetworkAttachmentDefinition resource.
Version-Release number of selected component (if applicable):
oc Version: 4.10.0-202203141248.p0.g6db43e2.assembly.stream-6db43e2
OCP Version: 4.10.4
Kubernetes Version: v1.23.3+e419edf
ose-multus-cni:v4.1.0-7.155662231
How reproducible:
100%
Steps to Reproduce:
1. In an OCP cluster with multus installed - search which roles can view ("get") NetworkAttachmentDefinition resource, and see if "cluster-readers" role is part of this list, by running:
$ oc adm policy who-can get network-attachment-definitions | grep "cluster-reader"
Actual results:
Empty output
Expected results:
Non-empty output with "cluster-readers" in it, e.g. when running the same command for the Namespace resource:
$ oc adm policy who-can get namespace | grep "cluster-reader"
system:cluster-readers
Description of problem:
After upgrading from OpenShift 4.13 to 4.14 with Kuryr network type, the network operator shows as Degraded and the cluster version reports that it's unable to apply the 4.14 update. The issue seems to be related to mtu settings, as indicated by the message: "Not applying unsafe configuration change: invalid configuration: [cannot change mtu for the Pods Network]."
Version-Release number of selected component (if applicable):
Upgrading from 4.13 to 4.14 4.14.0-0.nightly-2023-09-15-233408 Kuryr network type RHOS-17.1-RHEL-9-20230907.n.1
How reproducible:
Consistently reproducible on attempting to upgrade from 4.13 to 4.14.
Steps to Reproduce:
1.Install OpenShift version 4.13 on OpenStack. 2.Initiate an upgrade to OpenShift version 4.14.
Actual results:
The network operator shows as Degraded with the message: network 4.13.13 True False True 13h Not applying unsafe configuration change: invalid configuration: [cannot change mtu for the Pods Network]. Use 'oc edit network.operator.openshift.io cluster' to undo the change. Additionally, "oc get clusterversions" shows: Unable to apply 4.14.0-0.nightly-2023-09-15-233408: wait has exceeded 40 minutes for these operators: network
Expected results:
The upgrade should complete successfully without any operator being degraded.
Additional info:
Some components remain at version 4.13.13 despite the upgrade attempt. Specifically, the dns, machine-config, and network operators are still at version 4.13.13. : $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.0-0.nightly-2023-09-15-233408 True False False 13h baremetal 4.14.0-0.nightly-2023-09-15-233408 True False False 13h cloud-controller-manager 4.14.0-0.nightly-2023-09-15-233408 True False False 13h cloud-credential 4.14.0-0.nightly-2023-09-15-233408 True False False 13h cluster-autoscaler 4.14.0-0.nightly-2023-09-15-233408 True False False 13h config-operator 4.14.0-0.nightly-2023-09-15-233408 True False False 13h console 4.14.0-0.nightly-2023-09-15-233408 True False False 13h control-plane-machine-set 4.14.0-0.nightly-2023-09-15-233408 True False False 13h csi-snapshot-controller 4.14.0-0.nightly-2023-09-15-233408 True False False 13h dns 4.13.13 True False False 13h etcd 4.14.0-0.nightly-2023-09-15-233408 True False False 13h image-registry 4.14.0-0.nightly-2023-09-15-233408 True False False 13h ingress 4.14.0-0.nightly-2023-09-15-233408 True False False 13h insights 4.14.0-0.nightly-2023-09-15-233408 True False False 13h kube-apiserver 4.14.0-0.nightly-2023-09-15-233408 True False False 13h kube-controller-manager 4.14.0-0.nightly-2023-09-15-233408 True False False 13h kube-scheduler 4.14.0-0.nightly-2023-09-15-233408 True False False 13h kube-storage-version-migrator 4.14.0-0.nightly-2023-09-15-233408 True False False 13h machine-api 4.14.0-0.nightly-2023-09-15-233408 True False False 13h machine-approver 4.14.0-0.nightly-2023-09-15-233408 True False False 13h machine-config 4.13.13 True False False 13h marketplace 4.14.0-0.nightly-2023-09-15-233408 True False False 13h monitoring 4.14.0-0.nightly-2023-09-15-233408 True False False 13h network 4.13.13 True False True 13h Not applying unsafe configuration change: invalid configuration: [cannot change mtu for the Pods Network]. Use 'oc edit network.operator.openshift.io cluster' to undo the change. node-tuning 4.14.0-0.nightly-2023-09-15-233408 True False False 12h openshift-apiserver 4.14.0-0.nightly-2023-09-15-233408 True False False 13h openshift-controller-manager 4.14.0-0.nightly-2023-09-15-233408 True False False 13h openshift-samples 4.14.0-0.nightly-2023-09-15-233408 True False False 12h operator-lifecycle-manager 4.14.0-0.nightly-2023-09-15-233408 True False False 13h operator-lifecycle-manager-catalog 4.14.0-0.nightly-2023-09-15-233408 True False False 13h operator-lifecycle-manager-packageserver 4.14.0-0.nightly-2023-09-15-233408 True False False 12h service-ca 4.14.0-0.nightly-2023-09-15-233408 True False False 13h storage 4.14.0-0.nightly-2023-09-15-233408 True False False 13h
Description of problem:
Updating the k* version to v0.27.2 in cluster samples operator for OCP 4.14 release
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
I get synchronization error in fully disconnected environment when i synchronize two time with the target mirror and there no change / diff between first synchronization and second. The first time synchronization works, on second synchronization there is an error and exit code -1.
This case occurs when you want synchronize your disconnected registry regularly and there is no change between two synchronization.
This case is presented hereafter:
https://docs.openshift.com/container-platform/4.11/installing/disconnected_install/installing-mirroring-disconnected.html#oc-mirror-differential-updates_installing-mirroring-disconnected
In documentation we have:
« Like this, the desired mirror content can be declared in the imageset configuration file statically while the mirror jobs are executed regularly, for example as part of a cron job. This way, the mirror can be kept up to date in an automated fashion”
The main question is how to synchronize fully disconnected registry regularly (with no change between each synchronization) without returning error.
Version-Release number of selected component (if applicable):
oc-mirror 4.11
How reproducible:
Follow https://docs.openshift.com/container-platform/4.11/installing/disconnected_install/installing-mirroring-disconnected.html#mirroring-image-set-full and synchronize two time with target mirror.
Steps to Reproduce:
1. oc-mirror --from=output-dir/mirror_seq1_000000.tar docker://quay-server.example.com/foo --dest-skip-tls 2. oc-mirror --from=output-dir/mirror_seq1_000000.tar docker://quay-server.example.com/foo --dest-skip-tls
Actual results:
oc-mirror --from=output-dir/mirror_seq1_000000.tar docker://quay-server.example.com/foo --dest-skip-tls Checking push permissions for quay-server.example.com Publishing image set from archive "output-dir/mirror_seq1_000000.tar" to registry "quay-server.example.com" error: error during publishing, expecting imageset with prefix mirror_seq2: invalid mirror sequence order, want 2, got 1 => return -1
Expected results:
oc-mirror --from=output-dir/mirror_seq1_000000.tar docker://quay-server.example.com/foo --dest-skip-tls ... No diff from last synchronization, nothing to do => return 0
Additional info:
Error is trigered in pkg/cli/mirror/sequence.go
+ default: + // Complete metadata checks + // UUID mismatch will now be seen as a new workspace. + klog.V(3).Info("Checking metadata sequence number") + currRun := current.PastMirror + incomingRun := incoming.PastMirror + if incomingRun.Sequence != (currRun.Sequence + 1) { + return &ErrInvalidSequence{currRun.Sequence + 1, incomingRun.Sequence} + }
Error management in ./pkg/cli/mirror/mirror.go may be warning, no difference and return 0 instead of -1.
} case diskToMirror: dir, err := o.createResultsDir() if err != nil { return err } o.OutputDir = dir // Publish from disk to registry // this takes care of syncing the metadata to the // registry backends. mapping, err = o.Publish(ctx) if err != nil { serr := &ErrInvalidSequence{} if errors.As(err, &serr) { return fmt.Errorf("error during publishing, expecting imageset with prefix mirror_seq%d: %v", serr.wantSeq, err) } return err }
Description of problem:
OSSM Daily builds were updated to no longer support the spec.techPreview.controlPlaneMode field and OSSM will not create a SMCP as a result. The field needs to be updated to spec.mode. Gateway API enhanced dev preview is currently broken (currently using latest 2.4 daily build because 2.4 is unreleased). This should be resolved before OSSM 2.4 is GA.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
100%
Steps to Reproduce:
1. Follow instructions in http://pastebin.test.redhat.com/1092754
Actual results:
CIO fails to create a SMCP "error": "failed to create ServiceMeshControlPlane openshift-ingress/openshift-gateway: admission webhook \"smcp.validation.maistra.io\" denied the request: the spec.techPreview.controlPlaneMode field is not supported in version 2.4+; use spec.mode"
Expected results:
CIO is able to create a SMCP
Additional info:
Description of the problem:
e2e-metal-assisted-day2-arm-workers-periodic job fails to install the day2 ARM worker because the the service marks the setup incompatible:
time="2023-04-04T12:03:37Z" level=error msg="cannot use arm64 architecture because it's not compatible on version of OpenShift" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).handlerClusterInfoOnRegisterInfraEnv" file="/assisted-service/internal/bminventory/inventory.go:4466" pkg=Inventory time="2023-04-04T12:03:37Z" level=error msg="Failed to register InfraEnv test-infra-infra-env-fd527e12 with id 3e21770d-d607-431c-967c-5f632bec0cfb. Error: cannot use arm64 architecture because it's not compatible on version of OpenShift" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterInfraEnvInternal.func1" file="/assisted-service/internal/bminventory/inventory.go:4528" cluster_id=3e21770d-d607-431c-967c-5f632bec0cfb go-id=235 pkg=Inventory request_id=f8dd7eeb-efa7-4828-a8c5-e1486a8bc1d2
How reproducible:
Run the job e2e-metal-assisted-day2-arm-workers which:
Steps to reproduce:
1.
2.
3.
Actual results:
The job fails to add the day2 worker and the assisted service log shows:
"Error: cannot use arm64 architecture because it's not compatible on version of OpenShift"
Expected results:
The installation of the day2 ARM worker succeed without errors.
Elior Erez I assign this ticket to you as it looks like it is linked to the feature support code, can you have a look?
Description of problem:
PRs were previously merged to add SC2S support via AWS SDK here: https://github.com/openshift/installer/pull/5710 https://github.com/openshift/installer/pull/5597 https://github.com/openshift/cluster-ingress-operator/pull/703 However, further updates to add support for SC2S region (us-isob-east-1) and new TC2S region (us-iso-west-1) are still required.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. Try to deploy a cluster on us-isob-east-1 or us-iso-west-1 2. 3.
Actual results:
Regions are not supported
Expected results:
Additional info:
Both TC2S and SC2S support ALIAS records now.
Description of problem:
For unknown reasons, the management cluster AWS endpoint service sometimes has an active connection leftover. This blocks the uninstallation, as the AWS endpoint service cannot be deleted before this connection is rejected.
Version-Release number of selected component (if applicable):
4.12.z,4.13.z,4.14.z
How reproducible:
Irregular
Steps to Reproduce:
1. 2. 3.
Actual results:
AWSEndpointService cannot be deleted by the hypershift operator, the uninstallation is stuck
Expected results:
There are no leftover active AWSEndpoint connections when deleting the AWSEndpointService and it can be deleted properly. OR Hypershift operator rejects active endpoint connections when trying to delete AWSEndpointServices from the management cluster aws account
Additional info:
Added mustgathers in comment.
Description of problem:
In the Konnectivity SOCKS proxy: currently the default is to proxy cloud endpoint traffic: https://github.com/openshift/hypershift/blob/main/konnectivity-socks5-proxy/main.go#L61 Due to this after this change: https://github.com/openshift/hypershift/commit/0c52476957f5658cfd156656938ae1d08784b202 The oauth server had a behavior change where it began to proxy iam traffic instead of not proxying it. This causes a regression in Satellite environments running with an HTTP_PROXY server. The original network traffic path needs to be restored
Version-Release number of selected component (if applicable):
4.13 4.12
How reproducible:
100%
Steps to Reproduce:
1. Setup HTTP_PROXY IBM Cloud Satellite environment 2. In the oauth-server pod run a curl against iam (curl -v https://iam.cloud.ibm.com) 3. It will log it is using proxy
Actual results:
It is using proxy
Expected results:
It should send traffic directly (as it does in 4.11 and 4.10)
Additional info:
This is a clone of issue OCPBUGS-18830. The following is the description of the original issue:
—
Description of problem:
Failed to install cluster on SC2S region as: level=error msg=Error: reading Security Group (sg-0b0cd054dd599602f) Rules: UnsupportedOperation: The functionality you requested is not available in this region.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-11-201102
How reproducible:
Always
Steps to Reproduce:
1. Create an OCP cluster on SC2S
Actual results:
Install fail: level=error msg=Error: reading Security Group (sg-0b0cd054dd599602f) Rules: UnsupportedOperation: The functionality you requested is not available in this region.
Expected results:
Install succeed.
Additional info:
* C2S region is not affected
Description of problem:
When you migrate a HostedCluster, the AWSEndpointService conflicts from the old MGMT Server with the new MGMT Server. The AWSPrivateLink_Controller does not have any validation when this happens. This is needed to make the Disaster Recovery HC Migration works. So the issue will raise up when the nodes of the HostedCluster cannot join the new Management cluster because the AWSEndpointServiceName is still pointing to the old one.
Version-Release number of selected component (if applicable):
4.12 4.13 4.14
How reproducible:
Follow the migration procedure from upstream documentation and the nodes in the destination HostedCluster will keep in NotReady state.
Steps to Reproduce:
1. Setup a management cluster with the 4.12-13-14/main version of the HyperShift operator. 2. Run the in-place node DR Migrate E2E test from this PR https://github.com/openshift/hypershift/pull/2138: bin/test-e2e \ -test.v \ -test.timeout=2h10m \ -test.run=TestInPlaceUpgradeNodePool \ --e2e.aws-credentials-file=$HOME/.aws/credentials \ --e2e.aws-region=us-west-1 \ --e2e.aws-zones=us-west-1a \ --e2e.pull-secret-file=$HOME/.pull-secret \ --e2e.base-domain=www.mydomain.com \ --e2e.latest-release-image="registry.ci.openshift.org/ocp/release:4.13.0-0.nightly-2023-03-17-063546" \ --e2e.previous-release-image="registry.ci.openshift.org/ocp/release:4.13.0-0.nightly-2023-03-17-063546" \ --e2e.skip-api-budget \ --e2e.aws-endpoint-access=PublicAndPrivate
Actual results:
The nodes stay in NotReady state
Expected results:
The nodes should join the migrated HostedCluster
Additional info:
Description of problem:
When forcing a reboot of a BMH with the annotation reboot.metal3.io: '{"force": true}' with a new preprovisioningimage URL the host never reboots.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-05-03-150228
How reproducible:
100%
Steps to Reproduce:
1. Create a BMH and stall the provisioning process at "provisioning" 2. Set a new URL in the preprovisioningimage 3. Set the force reboot annotation on the BMH (reboot.metal3.io: '{"force": true}')
Actual results:
Host does not reboot and the annotation remains on the BMH
Expected results:
Host reboots into the new image
Additional info:
This was reproduced using assisted installer (MCE central infrastructure management)
This is a ticket created based off a GitHub comment from a random user
Description of the problem:
See GitHub comment
How reproducible:
Unknown
Steps to reproduce:
1. See GitHub comment
Actual results:
DNS wildcard validation failure is a false-postiive
Expected results:
DNS wildcard validation should probably avoid domain-search
Description of problem:
During cluster installation if the host systems had multiple dual-stack interfaces configured via install-config.yaml, the installation will fail. Notably, when a single-stack ipv4 installation is attempted with multiple interfaces it is successful. Additionally, when a dual-stack installation is attempted with only a single interface it is successful.
Version-Release number of selected component (if applicable):
Reproduced on 4.12.1 and 4.12.7
How reproducible:
100%
Steps to Reproduce:
1. Assign an IPv4 and an IPv6 address to both the apiVIPs and ingressVIPs parameters in the install-config.yaml 2. Configure all hosts with at least two interfaces in the install-config.yaml 3. Assign an IPv4 and an IPv6 address to each interface in the install-config.yaml 4. Begin cluster installation and wait for failure
Actual results:
Failed cluster installation
Expected results:
Successful cluster installation
Additional info:
The cli option --logtostderr was removed in prometheus-adapter v0.11. CMO uses this argument and this currently blocks the update to v0.11: https://github.com/openshift/k8s-prometheus-adapter/pull/72
Iiuc we can simply drop this argument.
Description of problem:
SNO installation does not finish due to machine-config waiting for a non existing machine config. oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config True True True 14h Unable to apply 4.14.0-0.nightly-2023-08-23-075058: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)]] oc -n openshift-machine-config-operator logs machine-config-daemon-2stpc --tail 5 Defaulted container "machine-config-daemon" out of: machine-config-daemon, kube-rbac-proxy I0824 07:39:12.117508 22874 daemon.go:1370] In bootstrap mode E0824 07:39:12.117525 22874 writer.go:226] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-231b9341930d0616544ad05989a5c1b8" not found W0824 07:40:12.131400 22874 daemon.go:1630] Failed to persist NIC names: open /etc/systemd/network: no such file or directory I0824 07:40:12.131417 22874 daemon.go:1370] In bootstrap mode E0824 07:40:12.131429 22874 writer.go:226] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-231b9341930d0616544ad05989a5c1b8" not found
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-23-075058
How reproducible:
100%
Steps to Reproduce:
1. Deploy SNO with Telco DU profile 2. Wait for installation to finish
Actual results:
Installation doesn't complete due to master MCP being degraded waiting for a non-existing machineconfig.
Expected results:
Installation succeeds.
Additional info:
Attaching sosreport and must-gather
This is a clone of issue OCPBUGS-18113. The following is the description of the original issue:
—
Description of problem:
When the installer generates a CPMS, it should only add the `failureDomains` field when there is more than one failure domain. When there is only one failure domain, the fields from the failure domain, eg the zone, should be injected directly into the provider spec and the failure domain should be omitted. By doing this, we avoid having to care about failure domain injection logic for single zone clusters. Potentially avoiding bugs (such as some we have seen recently). IIRC we already did this for OpenStack, but AWS, Azure and GCP may not be affected.
Version-Release number of selected component (if applicable):
How reproducible:
Can be demonstrated on Azure on the westus region which has no AZs available. Currently the installer creates the following, which we can omit entirely: ``` failureDomains: platform: Azure azure: - zone: "" ```
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Apart from default SC, we should check if non-default SCs that were created on vSphere platform use datastore for which OCP has accessibility and necessary permissions.
This will avoid hard to debug errors in cases where customer creates additional SC but forgets to give necessary permission to newer datastore.
Description of problem:
When Creating Sample Devfile from the Samples Page, corresponding Topology Icon for the app is not set. This issue is not observed when we create a BuildImage from the Samples page.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Create a Sample Devfile App from the Samples Page 2. Go to the Topology Page and check the icon of the app created.
Actual results:
The generic Openshift logo is displayed
Expected results:
Need to show the corresponding app icon (Golang, Quarkus, etc.)
Additional info:
In case of creating sample of BuilderImage, the icon gets properly set as per the BuilderImage used. Current label: app.openshift.io/runtime=dotnet-basic Change to: app.openshift.io/runtime=dotnet
Please review the following PR: https://github.com/openshift/configmap-reload/pull/51
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
2023-08-29 15:43:27.066 1 ERROR ironic.api.method [None req-00977b71-1b61-4452-8f6c-a43a47b1e92e - - - - - -] Server-side error: "<Future at 0x7fe7b2b86250 state=finished raised OperationalError>". Detail: Traceback (most recent call last): File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1089, in _commit_impl self.engine.dialect.do_commit(self.connection) File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/default.py", line 686, in do_commit dbapi_connection.commit() sqlite3.OperationalError: database is locked
Description of problem:
Install issues for 4.14 && 4.15 where we lose contact with kublet on master nodes.
This search shows its happening on about 35% of azure sdn 4.14 jobs over the past week at least. There are no ovn hits.
1703590387039342592/artifacts/e2e-azure-sdn-upgrade/gather-extra/artifacts/nodes.json
{ "lastHeartbeatTime": "2023-09-18T02:33:11Z", "lastTransitionTime": "2023-09-18T02:35:39Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "Ready" }
4.14 is interesting as it is a minor upgrade from 4.13 and we see the install failures with a master node dropping out.
Build log shows
[36mINFO[0m[2023-09-18T02:03:03Z] Using explicitly provided pull-spec for release initial (registry.ci.openshift.org/ocp/release:4.13.0-0.ci-2023-09-17-050449)
ipi-azure-conf shows region centralus (not the single zone westus)
get ocp version: 4.13 /output Azure region: centralus
oc_cmds/nodes shows master-1 not ready
ci-op-82xkimh8-0dd98-9g9wh-master-1 NotReady control-plane,master 82m v1.26.7+c7ee51f 10.0.0.6 <none> Red Hat Enterprise Linux CoreOS 413.92.202309141211-0 (Plow)
ci-op-82xkimh8-0dd98-9g9wh-master-1-boot.log shows ignition
install log shows we have lost contact
time="2023-09-18T03:15:33Z" level=error msg="Cluster operator kube-apiserver Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: [Missing operand on node ci-op-82xkimh8-0dd98-9g9wh-master-0, Missing operand on node ci-op-82xkimh8-0dd98-9g9wh-master-2]\nNodeControllerDegraded: The master nodes not ready: node \"ci-op-82xkimh8-0dd98-9g9wh-master-1\" not ready since 2023-09-18 02:35:39 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
4.15 4.15.0-0.ci-2023-09-17-172341 and 4.14 4.14.0-0.ci-2023-09-18-020137
Version-Release number of selected component (if applicable):
How reproducible:
We are seeing this on a high number of failed payloads for 4.14 && 4.15. Additional recent failures
4.14.0-0.ci-2023-09-17-012321
aggregated-azure-sdn-upgrade-4.14-minor shows failures like: Passed 5 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success indicating that only 5 of the 10 runs were valid.
Checking install logs shows we have lost master-2
time="2023-09-17T02:44:22Z" level=error msg="Cluster operator kube-apiserver Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: [Missing operand on node ci-op-crj5cf00-0dd98-p5snd-master-1, Missing operand on node ci-op-crj5cf00-0dd98-p5snd-master-0]\nNodeControllerDegraded: The master nodes not ready: node \"ci-op-crj5cf00-0dd98-p5snd-master-2\" not ready since 2023-09-17 02:01:49 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
oc_cmds/nodes also shows master-2 not ready
4.15.0-0.nightly-2023-09-17-113421 install analysis failed due to azure tech preview oc_cmds/nodes shows master-1 not ready
4.15.0-0.ci-2023-09-17-112341 aggregated-azure-sdn-upgrade-4.15-minor only 5 of 10 runs are valid sample oc_cmds/nodes shows master-0 not ready
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When using the k8sResourcePrefix x-descriptor with custom resource kinds, the form-view dropdown selection currently doesn't accept the initial user selection...requiring the user to make their selection twice. Also...if the configuration panel contains multiple custom resource dropdowns, then each previous dropdown selection on the panel is also cleared each time the user configures another custom resource dropdown, requiring the user to also reconfigure each previous selection.Here's an example of my configuration below:specDescriptors: - displayName: Collection path: collection x-descriptors: - >- urn:alm:descriptor:io.kubernetes:abc.zzz.com:v1beta1:Collection - displayName: Endpoints path: 'mapping[0].endpoints[0].name' x-descriptors: - >- urn:alm:descriptor:io.kubernetes:abc.zzz.com:v1beta1:Endpoint - displayName: Requested Credential Secret path: 'mapping[0].endpoints[0].credentialName' x-descriptors: - 'urn:alm:descriptor:io.kubernetes:Secret' - displayName: Namespaces path: 'mapping[0].namespace' x-descriptors: - 'urn:alm:descriptor:io.kubernetes:Namespace' With this configuration, when a user wants to select a Collection or Endpoint from the form view dropdown, the user is forced to make their selection twice before the selection is accepted in the dropdown. Also, if the user does configure the Collection dropown, and then decides to configure the Endpoint dropdown, once the Endpoint selection is made, the Collection dropdown is then cleared.
Version-Release number of selected component (if applicable):
4.8
How reproducible:
Always
Steps to Reproduce:
1. Create a new project: oc new-project descriptor-test 2. Create the resources in this gist: oc create -f https://gist.github.com/TheRealJon/99aa89c4af87c4b68cd92a544cd7c08e/raw/a633ad172ff071232620913d16ebe929430fd77a/reproducer.yaml 3. In the admin console, go to the installed operators page in project 'descriptor-test' 4. Select Mock Operator from the list 5. Select "Create instance" in the Mock Resource provided API card 6. Scroll to the field-1 7. Select 'example-1' from the dropdown
Actual results:
Selection is not retained on the first click.
Expected results:
The selection should be retained on the first click.
Additional info:
In addition to this behavior, if a form has multiple k8sResourcePrefix dropdown fields, they all get cleared when attempting to select an item from one of them.
Description of problem:
The kube apiserver manages the endpoints resource of the default/kubernetes service so that pods can access the kube apiserver. It does this via the --advertise-address flag and the container port for the kube apiserver pod. Currently the HCCO overwrites the endpoints resource with another port. This conflicts with what the KAS manages, it should not do that.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create an AWS publicAndPrivate cluster with DNS hostnames and a Route publishing strategy for the apiserver.
Actual results:
The HCCO overwrites the default/kubernetes endpoints resource in the guest cluster.
Expected results:
The HCCO does not overwrite the default/kubernetes endpoints resource
Additional info:
Description of problem:
when cluster with abnormal operator status , run the `oc adm must-gather` will exit with code 1 .
Version-Release number of selected component (if applicable):
4.12/4.13
Actual results:
[36m[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-gfcpc deleted[0m [36m[0m [36m[0m [36mReprinting Cluster State:[0m [36mWhen opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:[0m [36mClusterID: 0ba6ca81-e6d8-4d15-b345-70f81bd5a005[0m [36mClusterVersion: Stable at "4.13.0-0.nightly-2023-04-01-062001"[0m [36mClusterOperators:[0m [36m clusteroperator/cloud-credential is not upgradeable because Upgradeable annotation cloudcredential.openshift.io/upgradeable-to on cloudcredential.operator.openshift.io/cluster object needs updating before upgrade. See Manually Creating IAM documentation for instructions on preparing a cluster for upgrade.[0m [36m clusteroperator/ingress is progressing: ingresscontroller "test-34166" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...[0m [36m).[0m [36mNot all ingress controllers are available.[0m [36m[0m [36m[0m [36m[0m [36mSTDERR:[0m [36merror: yaml: line 7: did not find expected key[0m[0m [36m[33m[08:06:46] INFO> Exit Status: 1 Expected results: {code:none} abnormal status of any of the operators should not affect must-gather's exit code
Additional info:
Description of problem:
Alibaba clusters were never declared GA. They are still in TechPreview. We do not allow upgrades between TechPreview clusters in minor streams (eg 4.12 to 4.13) To allow a future deprecation and removal of the platform, we will prevent upgrades past 4.13.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a manual clone of https://issues.redhat.com/browse/OCPBUGS-18902 for backporting purposes.
In this recent PR that merged, a number of API calls do not use caches causing excessive calls.
Done when:
-Change all Get() calls to use listers
-API call metric should decrease
When a HostedCluster is configured as `Private`, annotate the necessary hosted CP components (API and OAuth) so that External DNS can still create public DNS records (pointing to private IP resources).
The External DNS record should be pointing to the resource for the PrivateLink VPC Endpoint. "We need to specify the IP of the A record. We can do that with a cluster IP service."
Context: https://redhat-internal.slack.com/archives/C01C8502FMM/p1675432805760719
aws-ebs-csi-driver-operator ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.
Description of problem:
Quoting Joel: In 4.14 there's been an effort to make Machine API optional, anything that that relies on the CRD needs to be able to detect that the CRD is not installed and then not error should that be the case. You should be able to use a discovery client to determine if the API group is installed or not We have several controllers and informers that are depending on the machine API to be at least available to list and sync caches with. When the API is not installed at all the depending controllers are blocked forever and eventually get killed by the aliveness probe. That causes hot restart loops that cause installations to fail.
https://redhat-internal.slack.com/archives/C027U68LP/p1690436286860899
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. install a machineAPI=false cluster 2. ??? 3. watch it fail
Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-16776.
Description of problem:
CPMS create two replace machines when deleting a master machine on vSphere. Sorry, I have to revisit this https://issues.redhat.com/browse/OCPBUGS-4297 as I see all the related pr are merged, but I met twice on this template cluster ipi-on-vsphere/versioned-installer-vmc7-ovn-winc-thin_pvc-ci, once on ipi-on-vsphere/versioned-installer-vmc7-ovn template cluster today
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-02-13-235211
How reproducible:
Three times
Steps to Reproduce:
1. On this template cluster ipi-on-vsphere/versioned-installer-vmc7-ovn-winc-thin_pvc-ci, the first time I met this is after update all the 3 master machines using RollingUpdate strategy, then I delete a master machine. But seems the redundant machine was automatically deleted, because there was only one replacement machine when I revisit it. liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs15b-75tr7-master-djlxv-2 Running 47m huliu-vs15b-75tr7-master-h76sp-1 Running 58m huliu-vs15b-75tr7-master-wtzb7-0 Running 70m huliu-vs15b-75tr7-worker-gzsp9 Running 4h43m huliu-vs15b-75tr7-worker-vcqqh Running 4h43m winworker-4cltm Running 4h19m winworker-qd4c4 Running 4h19m liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs15b-75tr7-master-djlxv-2 machine.machine.openshift.io "huliu-vs15b-75tr7-master-djlxv-2" deleted ^C liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs15b-75tr7-master-bzd4h-2 Provisioning 34s huliu-vs15b-75tr7-master-djlxv-2 Deleting 48m huliu-vs15b-75tr7-master-gzhlk-2 Provisioning 35s huliu-vs15b-75tr7-master-h76sp-1 Running 59m huliu-vs15b-75tr7-master-wtzb7-0 Running 70m huliu-vs15b-75tr7-worker-gzsp9 Running 4h44m huliu-vs15b-75tr7-worker-vcqqh Running 4h44m winworker-4cltm Running 4h20m winworker-qd4c4 Running 4h20m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs15b-75tr7-master-bzd4h-2 Running 38m huliu-vs15b-75tr7-master-h76sp-1 Running 97m huliu-vs15b-75tr7-master-wtzb7-0 Running 108m huliu-vs15b-75tr7-worker-gzsp9 Running 5h22m huliu-vs15b-75tr7-worker-vcqqh Running 5h22m winworker-4cltm Running 4h57m winworker-qd4c4 Running 4h57m 2.Then I change the strategy to OnDelete, and after update all the 3 master machines using OnDelete strategy, then I delete a master machine. liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs15b-75tr7-master-hzhgq-0 Running 137m huliu-vs15b-75tr7-master-kj9zf-2 Running 89m huliu-vs15b-75tr7-master-kz6cx-1 Running 59m huliu-vs15b-75tr7-worker-gzsp9 Running 7h46m huliu-vs15b-75tr7-worker-vcqqh Running 7h46m winworker-4cltm Running 7h21m winworker-qd4c4 Running 7h21m liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs15b-75tr7-master-hzhgq-0 machine.machine.openshift.io "huliu-vs15b-75tr7-master-hzhgq-0" deleted ^C liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs15b-75tr7-master-hzhgq-0 Deleting 138m huliu-vs15b-75tr7-master-kb687-0 Provisioning 26s huliu-vs15b-75tr7-master-kj9zf-2 Running 90m huliu-vs15b-75tr7-master-kz6cx-1 Running 60m huliu-vs15b-75tr7-master-qn6kq-0 Provisioning 26s huliu-vs15b-75tr7-worker-gzsp9 Running 7h47m huliu-vs15b-75tr7-worker-vcqqh Running 7h47m winworker-4cltm Running 7h22m winworker-qd4c4 Running 7h22m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs15b-75tr7-master-kb687-0 Running 154m huliu-vs15b-75tr7-master-kj9zf-2 Running 4h5m huliu-vs15b-75tr7-master-kz6cx-1 Running 3h34m huliu-vs15b-75tr7-master-qn6kq-0 Running 154m huliu-vs15b-75tr7-worker-gzsp9 Running 10h huliu-vs15b-75tr7-worker-vcqqh Running 10h winworker-4cltm Running 9h winworker-qd4c4 Running 9h liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2023-02-13-235211 True False False 5h13m baremetal 4.13.0-0.nightly-2023-02-13-235211 True False False 10h cloud-controller-manager 4.13.0-0.nightly-2023-02-13-235211 True False False 10h cloud-credential 4.13.0-0.nightly-2023-02-13-235211 True False False 10h cluster-autoscaler 4.13.0-0.nightly-2023-02-13-235211 True False False 10h config-operator 4.13.0-0.nightly-2023-02-13-235211 True False False 10h console 4.13.0-0.nightly-2023-02-13-235211 True False False 145m control-plane-machine-set 4.13.0-0.nightly-2023-02-13-235211 True False True 10h Observed 1 updated machine(s) in excess for index 0 csi-snapshot-controller 4.13.0-0.nightly-2023-02-13-235211 True False False 10h dns 4.13.0-0.nightly-2023-02-13-235211 True False False 10h etcd 4.13.0-0.nightly-2023-02-13-235211 True False False 10h image-registry 4.13.0-0.nightly-2023-02-13-235211 True False False 9h ingress 4.13.0-0.nightly-2023-02-13-235211 True False False 10h insights 4.13.0-0.nightly-2023-02-13-235211 True False False 10h kube-apiserver 4.13.0-0.nightly-2023-02-13-235211 True False False 10h kube-controller-manager 4.13.0-0.nightly-2023-02-13-235211 True False False 10h kube-scheduler 4.13.0-0.nightly-2023-02-13-235211 True False False 10h kube-storage-version-migrator 4.13.0-0.nightly-2023-02-13-235211 True False False 6h18m machine-api 4.13.0-0.nightly-2023-02-13-235211 True False False 10h machine-approver 4.13.0-0.nightly-2023-02-13-235211 True False False 10h machine-config 4.13.0-0.nightly-2023-02-13-235211 True False False 3h59m marketplace 4.13.0-0.nightly-2023-02-13-235211 True False False 10h monitoring 4.13.0-0.nightly-2023-02-13-235211 True False False 10h network 4.13.0-0.nightly-2023-02-13-235211 True False False 10h node-tuning 4.13.0-0.nightly-2023-02-13-235211 True False False 10h openshift-apiserver 4.13.0-0.nightly-2023-02-13-235211 True False False 145m openshift-controller-manager 4.13.0-0.nightly-2023-02-13-235211 True False False 10h openshift-samples 4.13.0-0.nightly-2023-02-13-235211 True False False 10h operator-lifecycle-manager 4.13.0-0.nightly-2023-02-13-235211 True False False 10h operator-lifecycle-manager-catalog 4.13.0-0.nightly-2023-02-13-235211 True False False 10h operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2023-02-13-235211 True False False 6h7m service-ca 4.13.0-0.nightly-2023-02-13-235211 True False False 10h storage 4.13.0-0.nightly-2023-02-13-235211 True False False 3h57m liuhuali@Lius-MacBook-Pro huali-test % 3.On ipi-on-vsphere/versioned-installer-vmc7-ovn template cluster, after update all the 3 master machines using RollingUpdate strategy, no issue, then delete a master machine, no issue, then change the strategy to OnDelete, and replace the master machines one by one, when I delete the last one, two replace machines created. liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2023-02-13-235211 True False False 73m baremetal 4.13.0-0.nightly-2023-02-13-235211 True False False 9h cloud-controller-manager 4.13.0-0.nightly-2023-02-13-235211 True False False 9h cloud-credential 4.13.0-0.nightly-2023-02-13-235211 True False False 9h cluster-autoscaler 4.13.0-0.nightly-2023-02-13-235211 True False False 9h config-operator 4.13.0-0.nightly-2023-02-13-235211 True False False 9h console 4.13.0-0.nightly-2023-02-13-235211 True False False 129m control-plane-machine-set 4.13.0-0.nightly-2023-02-13-235211 True True False 9h Observed 1 replica(s) in need of update csi-snapshot-controller 4.13.0-0.nightly-2023-02-13-235211 True False False 9h dns 4.13.0-0.nightly-2023-02-13-235211 True False False 9h etcd 4.13.0-0.nightly-2023-02-13-235211 True False False 9h image-registry 4.13.0-0.nightly-2023-02-13-235211 True False False 8h ingress 4.13.0-0.nightly-2023-02-13-235211 True False False 8h insights 4.13.0-0.nightly-2023-02-13-235211 True False False 8h kube-apiserver 4.13.0-0.nightly-2023-02-13-235211 True False False 9h kube-controller-manager 4.13.0-0.nightly-2023-02-13-235211 True False False 9h kube-scheduler 4.13.0-0.nightly-2023-02-13-235211 True False False 9h kube-storage-version-migrator 4.13.0-0.nightly-2023-02-13-235211 True False False 3h22m machine-api 4.13.0-0.nightly-2023-02-13-235211 True False False 9h machine-approver 4.13.0-0.nightly-2023-02-13-235211 True False False 9h machine-config 4.13.0-0.nightly-2023-02-13-235211 True False False 9h marketplace 4.13.0-0.nightly-2023-02-13-235211 True False False 9h monitoring 4.13.0-0.nightly-2023-02-13-235211 True False False 8h network 4.13.0-0.nightly-2023-02-13-235211 True False False 9h node-tuning 4.13.0-0.nightly-2023-02-13-235211 True False False 9h openshift-apiserver 4.13.0-0.nightly-2023-02-13-235211 True False False 9h openshift-controller-manager 4.13.0-0.nightly-2023-02-13-235211 True False False 9h openshift-samples 4.13.0-0.nightly-2023-02-13-235211 True False False 9h operator-lifecycle-manager 4.13.0-0.nightly-2023-02-13-235211 True False False 9h operator-lifecycle-manager-catalog 4.13.0-0.nightly-2023-02-13-235211 True False False 9h operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2023-02-13-235211 True False False 46m service-ca 4.13.0-0.nightly-2023-02-13-235211 True False False 9h storage 4.13.0-0.nightly-2023-02-13-235211 True False False 77m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs15a-kjm6h-master-55s4l-1 Running 84m huliu-vs15a-kjm6h-master-ppc55-2 Running 3h4m huliu-vs15a-kjm6h-master-rqb52-0 Running 53m huliu-vs15a-kjm6h-worker-6nbz7 Running 9h huliu-vs15a-kjm6h-worker-g84xg Running 9h liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs15a-kjm6h-master-ppc55-2 machine.machine.openshift.io "huliu-vs15a-kjm6h-master-ppc55-2" deleted ^C liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs15a-kjm6h-master-55s4l-1 Running 85m huliu-vs15a-kjm6h-master-cvwzz-2 Provisioning 27s huliu-vs15a-kjm6h-master-ppc55-2 Deleting 3h5m huliu-vs15a-kjm6h-master-qp9m5-2 Provisioning 27s huliu-vs15a-kjm6h-master-rqb52-0 Running 54m huliu-vs15a-kjm6h-worker-6nbz7 Running 9h huliu-vs15a-kjm6h-worker-g84xg Running 9h liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs15a-kjm6h-master-55s4l-1 Running 163m huliu-vs15a-kjm6h-master-cvwzz-2 Running 79m huliu-vs15a-kjm6h-master-qp9m5-2 Running 79m huliu-vs15a-kjm6h-master-rqb52-0 Running 133m huliu-vs15a-kjm6h-worker-6nbz7 Running 10h huliu-vs15a-kjm6h-worker-g84xg Running 10h liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
CPMS create two replace machines when deleting a master machine, and the two replace machines exist there for a long time
Expected results:
CPMS should only create one replace machine when deleting a master machine, or quickly delete the redundant machine
Additional info:
Must-gather: https://drive.google.com/file/d/1aCyFn9okNxRz7nE3Yt_8g6Kx7sPSGCg2/view?usp=sharing for ipi-on-vsphere/versioned-installer-vmc7-ovn-winc-thin_pvc-ci template cluster https://drive.google.com/file/d/1i0fWSP0-HqfdV5E0wcNevognLUQKecvl/view?usp=sharing for ipi-on-vsphere/versioned-installer-vmc7-ovn template cluster
This is a clone of issue OCPBUGS-19494. The following is the description of the original issue:
—
Description of problem:
ipsec container kills pluto even if that was started by systemd
Version-Release number of selected component (if applicable):
on any 4.14 nightly
How reproducible:
every time
Steps to Reproduce:
1. enable N-S ipsec 2. enable E-W IPsec 3. kill/stop/delete one of the ipsec-host pods
Actual results:
pluto is killed on that host
Expected results:
pluto keeps running
Additional info:
https://github.com/yuvalk/cluster-network-operator/blob/37d1cc72f4f6cd999046bd487a705e6da31301a5/bindata/network/ovn-kubernetes/common/ipsec-host.yaml#L235 this should be removed
Description of problem:
according to PR: https://github.com/openshift/cluster-monitoring-operator/pull/1824, startupProbe for UWM prometheus/platform prometheus should be 1 hour, but startupProbe for UWM prometheus is still 15m after enabled UWM, platform promethues does not have issue, startupProbe is increased to 1 hour
$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml | grep startupProbe -A20 startupProbe: exec: command: - sh - -c - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi failureThreshold: 60 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 3 ... $ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep startupProbe -A20 startupProbe: exec: command: - sh - -c - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi failureThreshold: 240 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 3
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-19-052243
How reproducible:
always
Steps to Reproduce:
1. enable UWM, check startupProbe for UWM prometheus/platform prometheus 2. 3.
Actual results:
startupProbe for UWM prometheus is still 15m
Expected results:
startupProbe for UWM prometheus should be 1 hour
Additional info:
since startupProbe for platform prometheus is increased to 1 hour, and no similar bug for UWM prometheus, won't fix the issue is OK.
When ProjectID is not set, TenantID might be ignored in MAPO.
Context: When setting additional networks in Machine templates, networks can be identified by the means of a filter. The network filter has both TenantID and ProjectID as fields. TenantID was ignored.
Steps to reproduce:
Create a Machine or a MachineSet with a template containing a Network filter that sets a TenantID.
```
networks:
One cheap way of testing this could be to pass a valid network ID and set a bogus tenantID. If the machine gets associated with the network, then tenantID has been ignored and the bug is present. If instead MAPO errors, then in means that it has taken tenantID into consideration.
Description of problem:
This Jira is filed to track upstream issue (fix and backport) https://github.com/kubernetes-sigs/azurefile-csi-driver/issues/1308
Version-Release number of selected component (if applicable):
4.14
Description of problem:
[Hypershift] default KAS PSA config should be consistent with OCP enforce: privileged
Version-Release number of selected component (if applicable):
Cluster version is 4.14.0-0.nightly-2023-10-08-220853
How reproducible:
Always
Steps to Reproduce:
1. Install OCP cluster and hypershift operator 2. Create hosted cluster 3. Check the default kas config of the hosted cluster
Actual results:
The hosted cluster default kas PSA config enforce is 'restricted' $ jq '.admission.pluginConfig.PodSecurity' < `oc extract cm/kas-config -n clusters-9cb7724d8bdd0c16a113 --confirm` { "location": "", "configuration": { "kind": "PodSecurityConfiguration", "apiVersion": "pod-security.admission.config.k8s.io/v1beta1", "defaults": { "enforce": "restricted", "enforce-version": "latest", "audit": "restricted", "audit-version": "latest", "warn": "restricted", "warn-version": "latest" }, "exemptions": { "usernames": [ "system:serviceaccount:openshift-infra:build-controller" ] } } }
Expected results:
The hosted cluster default kas PSA config enforce should be 'privileged' in https://github.com/openshift/hypershift/blob/release-4.13/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L93
Additional info:
References: OCPBUGS-8710
Description of problem:
oauth user:check-access scoped tokens can not be used to check access as intended. SelfSubjectAccessReviews from such scoped token always report allowed: false, denied: true. Unless the SelfSubjectAccessReview is checking access for ability to create SelfSubjectAccessReviews. This does not seem like the intended behavior per documentation. https://docs.openshift.com/container-platform/4.12/authentication/tokens-scoping.html oauth user:check-access scoped tokens only have authorization for SelfSubjectAccessReview. This is as intended. This seems to be limited by the scopeauthorizor. However, the authorizor used by SelfSubjectAccessReview includes this filter, meaning the returned response is useless (you can only check-access to SelfSubjectAccessReview itself instead of using the token to check access of RBAC of the parent user the token is scoped from). https://github.com/openshift/kubernetes/blob/master/openshift-kube-apiserver/authorization/scopeauthorizer/authorizer.go https://github.com/openshift/kubernetes/blob/master/pkg/registry/authorization/selfsubjectaccessreview/rest.go
Version-Release number of selected component (if applicable):
How reproducible:
Create user:check-access scoped token. Token must not have user:full scope. Use the token to do a SelfSubjectAccessReview.
Steps to Reproduce:
1. Create user:check-access scoped token. Must not have user:full scope. 2. Use the token to do a SelfSubjectAccessReview against a resource the parent user has access to. 3. Observe the status response is allowed: false, denied: true.
Actual results:
Unable to check user access with a user:check-access scoped token.
Expected results:
Ability to check user access with a user:check-access scoped token, without user:full scope which would give the token full access and abilities of the parent user.
Additional info:
Some tests may cause unexpected reboots of nodes. On HA setups this is checked by "should report ready nodes the entire duration of the test run" test, which ensures Prometheus metric for node readiness didn't flip.
On SNO however we can't use the metrics, as the prometheus will go down along with the node and the node would become ready again before Prometheus/kube-state-metrics is up again. For SNO we have to check that the node has expected number of reboots - number of "rendered-master/rendered-worker" MC + 1
This is a clone of issue OCPBUGS-18906. The following is the description of the original issue:
—
Using packages from k8s.io/kubernetes is not supported: https://github.com/kubernetes/kubernetes/issues/79384#issuecomment-505627280
This came about in this slack thread: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1694210392218409?thread_ts=1694207119.447459&cid=C02CZNQHGN8
Description of problem:
The MCDaemon has a codepath for "pivot" used in older versions, and then as part of solutions articles to initiate a direct pivot to an ostree version, mostly used when things fail. As of 4.12 this codepath should no longer work due to us switching to new format OSImage, so we should fully deprecate it. This is likely where it fails: https://github.com/openshift/machine-config-operator/blob/ecc6bf3dc21eb33baf56692ba7d54f9a3b9be1d1/pkg/daemon/rpm-ostree.go#L248
Version-Release number of selected component (if applicable):
4.12+
How reproducible:
Not sure but should be 100%
Steps to Reproduce:
1. Follow https://access.redhat.com/solutions/5598401 2. 3.
Actual results:
fails
Expected results:
MCD telling you pivot is deprecated
Additional info:
Description of problem:
Secrets generated by CCO in STS mode is different than the one created by ccoctl on cmdline.
ccoctl generates:
[default] sts_regional_endpoints = regional role_arn = arn:aws:iam::269733383066:role/jsafrane-1-5h8rm-openshift-cluster-csi-drivers-aws-efs-cloud-cre web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
CCO generates:
sts_regional_endpoints = regional
role_arn = arn:aws:iam::269733383066:role/jsafrane-1-5h8rm-openshift-cluster-csi-drivers-aws-efs-cloud-cre
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
IMO these two should be the same. AWS EFS CSI driver does not work without "[default]" at the beginning.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-11-092038
How reproducible:
Always
Steps to Reproduce:
1. Create a Manual mode, STS cluster in AWS. 2. Create a CredentialsRequest which provides .spec.cloudTokenPath and .spec.providerSpec.stsIAMRoleARN. 3. Observe that secret is created by CCO in the target namespace specified by the CredentialsRequest.
Actual results:
The secrets does not have [default] in the `data` content.
Expected results:
Background
When we run our agent we set the proxy environment variables as can be seen here
When the user SSHs into the host, the shell does not have those environment variables set.
Issue
This means that when the user is trying to debug network connectivity (for example, in day-2 users often SSH to see why they can't reach the day-1 cluster's API), they will usually try to run curl to see whether they can reach the URL themselves, but it might behave differently than the agent because the shell, by default, doesn't use the proxy settings.
Solution
Set the default environment variables (through .profile) of the core and root shells to include the same proxy environment variables as the agent, so that when the user logs into the host to run commands, they would have the same proxy settings as the ones the agent has.
Example
One example where we ran into this issue is when a customer forgot to set the correct noProxy settings in the UI during day-2, and so the agent was complaining about not being able to reach the day-1 API server (as the API server is unreachable through the proxy), but when we SSHd into the host and tried to curl, everything seemed to be working fine. Only after we ran tcpdump to see the difference in requests that we noticed the agent was routing requests through the proxy but curl wasn't, because the shell didn't have the proxy settings by default. If the shell had the correct proxy settings, it would've been easier to troubleshoot the problem.
Description of problem:
The NS autolabeler should adjust the PSS namespace labels such that a previously permitted workload (based on the SCCs it has access to) can still run.
The autolabeler requires the RoleBinding's .subjects[].namespace to be set when .subjects[].kind is ServiceAccount even though this is not required by the RBAC system to successfully bind the SA to a Role
Version-Release number of selected component (if applicable):
$ oc version
Client Version: 4.7.0-0.ci-2021-05-21-142747
Server Version: 4.12.0-0.nightly-2022-08-15-150248
Kubernetes Version: v1.24.0+da80cd0
How reproducible: 100%
Steps to Reproduce:
---
apiVersion: v1
kind: Namespace
metadata:
name: test
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: mysa
namespace: test
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: myrole
namespace: test
rules:
- apiGroups:
- security.openshift.io
resourceNames:
- privileged
resources:
- securitycontextconstraints
verbs:
- use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: myrb
namespace: test
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: myrole
subjects:
- kind: ServiceAccount
name: mysa
#namespace: test # This is required for the autolabeler
---
kind: Job
apiVersion: batch/v1
metadata:
name: myjob
namespace: test
spec:
template:
spec:
containers:
- name: ubi
image: registry.access.redhat.com/ubi8
command: ["/bin/bash", "-c"]
args: ["whoami; sleep infinity"]
restartPolicy: Never
securityContext:
runAsUser: 0
serviceAccount: mysa
terminationGracePeriodSeconds: 2
{{}}
Actual results:
Applying the manifest, above, the Job's pod will not start:
$ kubectl -n test describe job/myjob...Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 20s job-controller Error creating: pods "myjob-zxcvv" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
Warning FailedCreate 20s job-controller Error creating: pods "myjob-fkb9x" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
Warning FailedCreate 10s job-controller Error creating: pods "myjob-5klpc" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
Uncommenting the "namespace" field in the RoleBinding will allow it to start as the autolabeler will adjust the Namespace labels.
However, the namespace field isn't actually required by the RBAC system. Instead of using the autolabeler, the pod can be allowed to run by (w/o uncommenting the field):
$ kubectl label ns/test security.openshift.io/scc.podSecurityLabelSync=false
namespace/test labeled
$ kubectl label ns/test pod-security.kubernetes.io/enforce=privileged --overwrite
namespace/test labeled
We now see that the pod is running as root and has access to the privileged scc:
$ kubectl -n test get po -oyaml
apiVersion: v1
items:
- apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.2.18/23"],"mac_address":"0a:58:0a:81:02:12","gateway_ips":["10.129.2.1"],"ip_address":"10.129.2.18/23","gateway_ip":"10.129.2.1"'}}
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.129.2.18"
],
"mac": "0a:58:0a:81:02:12",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.129.2.18"
],
"mac": "0a:58:0a:81:02:12",
"default": true,
"dns": {}
}]
openshift.io/scc: privileged
creationTimestamp: "2022-08-16T13:08:24Z"
generateName: myjob-
labels:
controller-uid: 1867dbe6-73b2-44ea-a324-45c9273107b8
job-name: myjob
name: myjob-rwjmv
namespace: test
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: myjob
uid: 1867dbe6-73b2-44ea-a324-45c9273107b8
resourceVersion: "36418"
uid: 39f18dea-31d4-4783-85b5-8ae6a8bec1f4
spec:
containers:
- args:
- whoami; sleep infinity
command:
- /bin/bash
- -c
image: registry.access.redhat.com/ubi8
imagePullPolicy: Always
name: ubi
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-6f2h6
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
imagePullSecrets:
- name: mysa-dockercfg-mvmtn
nodeName: ip-10-0-140-172.ec2.internal
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext:
runAsUser: 0
serviceAccount: mysa
serviceAccountName: mysa
terminationGracePeriodSeconds: 2
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-6f2h6
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
- configMap:
items:
- key: service-ca.crt
path: service-ca.crt
name: openshift-service-ca.crt
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-08-16T13:08:24Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-08-16T13:08:28Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-08-16T13:08:28Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-08-16T13:08:24Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: cri-o://8fd1c3a5ee565a1089e4e6032bd04bceabb5ab3946c34a2bb55d3ee696baa007
image: registry.access.redhat.com/ubi8:latest
imageID: registry.access.redhat.com/ubi8@sha256:08e221b041a95e6840b208c618ae56c27e3429c3dad637ece01c9b471cc8fac6
lastState: {}
name: ubi
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2022-08-16T13:08:28Z"
hostIP: 10.0.140.172
phase: Running
podIP: 10.129.2.18
podIPs:
- ip: 10.129.2.18
qosClass: BestEffort
startTime: "2022-08-16T13:08:24Z"
kind: List
metadata:
resourceVersion: ""
{{}}
$ kubectl -n test logs job/myjob
root
Expected results:
The autolabeler should properly follow the RoleBinding back to the SCC
Additional info:
Description of problem:
While updating a cluster to 4.12.11, which contains the bug fix for [OCPBUGS-7999|https://issues.redhat.com/browse/OCPBUGS-7999] (which is the 4.12.z backport of [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783], it seems that the older {{{Custom|Default}RouteSync{Degraded|Progressing}}} conditions are not cleaned up as they should, as per [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783] resolution, while the newer ones are added. Due to this, on an upgrade to 4.12.11 (or higher, until this bug is fixed), it is possible to hit a problem very similar to the one that lead to [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783] in the first place, but while upgrading to 4.12.11. So, we need to do a proper cleanup of the older conditions.
Version-Release number of selected component (if applicable):
4.12.11 and higher
How reproducible:
Always in what regards the wrong conditions. It only leads to issues if one of the wrong conditions was in unhealthy state.
Steps to Reproduce:
1. Upgrade 2. 3.
Actual results:
Both new (and correct) conditions plus older (and wrong) conditions.
Expected results:
Both new (and correct) conditions only.
Additional info:
Problem seems to be that the stale conditions controller is created[1] with a list that says {{CustomRouteSync}} and {{DefaultRouteSync}}, while that list should be {{CustomRouteSyncDegraded}}, {{CustomRouteSyncProgressing}}, {{DefaultRouteSyncDegraded}} and {{DefaultRouteSyncProgressing}}. I read the source code of the controller a bit and it seems that it does not admit prefixes but performs a literal comparison. [1] - https://github.com/openshift/console-operator/blob/0b54727/pkg/console/starter/starter.go#L403-L404
Description of problem:
During the creation of a new HostedCluster, the control-plane-operator reports several lines of logs like
{"level":"error","ts":"2023-05-04T05:24:03Z","msg":"failed to remove service ca annotation and secret: %w","controller":"hostedcontrolplane","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedControlPlane","hostedControlPlane":{"name":"demo-02","namespace":"clusters-demo-02"},"namespace":"clusters-demo-02","name":"demo-02","reconcileID":"5ffe0a7f-94ce-4745-b89d-4d5168cabe8d","error":"failed to get service: Service \"node-tuning-operator\" not found","stacktrace":"github.com/openshift/hypershift/control-plane-operator/controllers/hostedcontrolplane.(*HostedControlPlaneReconciler).reconcile\n\t/hypershift/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:929\ngithub.com/openshift/hypershift/control-plane-operator/controllers/hostedcontrolplane.(*HostedControlPlaneReconciler).update\n\t/hypershift/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:830\ngithub.com/openshift/hypershift/control-plane-operator/controllers/hostedcontrolplane.(*HostedControlPlaneReconciler).Reconcile\n\t/hypershift/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:677\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}
Until the Service / Secret are created.
Version-Release number of selected component (if applicable):
Management cluster: 4.14.0-nightly Hosted Cluster: 4.13.0 or 4.14.0-nightly
How reproducible:
Always
Steps to Reproduce:
1. Create a hosted cluster
Actual results:
HostedCluster is created but there are several unnecessary "error" logs in the control-plane-operator
Expected results:
No error logs from control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:removeServiceCAAnnotationAndSecret() during normal cluster creation
Additional info:
Marko Luksa mentioned multus missing '/etc/cni/multus/net.d' mount in OCP4.14 and here's the repro step (verivied in multus team)
Our original reproducer would be too complex, so I had to write a simple one for you:
Use a 4.14 OpenShift cluster
Create the CNI plugin installer DaemonSet in namespace test:
oc apply -f https://gist.githubusercontent.com/luksa/c4d444e918124604839c424339c29a62/raw/1454bd389138980ea3f93bcfaf6026d4821e3543/noop-cni-plugin-installer.yaml
Create the test Deployment:
oc apply -f https://gist.githubusercontent.com/luksa/4c7c144ef88b1b0d8f772d6eacdeec14/raw/06b161fdb8c71406f4531d35550bd507a6a25200/test-deployment.yaml
Describe the test pod:
oc -n test describe po test
The last event shows the following:
ERRORED: error configuring pod [test/test-6cf67dcfb6-hgszq] networking: Multus: [test/test-6cf67dcfb6-hgszq/3e8a6f0d-ce84-4885-a7a7-43506669339f]: error loading k8s delegates k8s args: TryLoadPodDelegates: error in getting k8s network for pod: GetNetworkDelegates: failed getting the delegate: GetCNIConfig: err in GetCNIConfigFromFile: No networks found in /etc/cni/multus/net.d
The same reproducer runs fine on OCP 4.13
Description of problem:
The current version of openshift/router vendors Kubernetes 1.26 packages. OpenShift 4.14 is based on Kubernetes 1.27.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Check https://github.com/openshift/router/blob/release-4.14/go.mod
Actual results:
Kubernetes packages (k8s.io/api, k8s.io/apimachinery, k8s.io/apiserver, and k8s.io/client-go) are at version v0.26
Expected results:
Kubernetes packages are at version v0.27.0 or later.
Additional info:
Using old Kubernetes API and client packages brings risk of API compatibility issues.
Description of problem:
I attempted to install a BM SNO with the agent based installer. In the install_config, I disabled all supported capabilities except marketplace. Install_config snippet: capabilities: baselineCapabilitySet: None additionalEnabledCapabilities: - marketplace The system installed fine but the capabilities config was not passed down to the cluster. clusterversion: status: availableUpdates: null capabilities: enabledCapabilities: - CSISnapshot - Console - Insights - Storage - baremetal - marketplace - openshift-samples knownCapabilities: - CSISnapshot - Console - Insights - Storage - baremetal - marketplace - openshift-samples oc -n kube-system get configmap cluster-config-v1 -o yaml apiVersion: v1 data: install-config: | additionalTrustBundlePolicy: Proxyonly apiVersion: v1 baseDomain: ptp.lab.eng.bos.redhat.com bootstrapInPlace: installationDisk: /dev/disk/by-id/wwn-0x62cea7f04d10350026c6f2ec315557a0 compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 0 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 1 metadata: creationTimestamp: null name: cnfde8 networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.16.231.0/24 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 platform: none: {} publish: External pullSecret: ""
Version-Release number of selected component (if applicable):
4.12.0-rc.5
How reproducible:
100%
Steps to Reproduce:
1. Install SNO with agent based installer as described above 2. 3.
Actual results:
Capabilities installed
Expected results:
Capabilities not installed
Additional info:
Description of problem:
When try to import the Helm chart "httpd-imagestreams" the "Create Helm Release" page shows a info alert that the form isn't avaiable because there isn't a schema for this helm chart. But the YAML view is also not visible.
Info Alert:
Form view is disabled for this chart because the schema is not available
Version-Release number of selected component (if applicable):
4.9-4.14 (current master)
How reproducible:
Always
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
The chart yaml is available here and doesn't contain a schema (at the moment).
Description of problem:
machine-config-operator will fail on clusters deployed with IPI on Power Virtual Server with the following error: Cluster not available for []: ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: spec.infra.status.platformStatus.powervs.resourceGroup: Invalid value: "": spec.infra.status.platformStatus.powervs.resourceGroup in body should match '^[a-zA-Z0-9-_
Version-Release number of selected component (if applicable):
4.14 and 4.13
How reproducible:
100%
Steps to Reproduce:
1. Deploy with openshift-installer to Power VS 2. Wait for masters to start deploying 3. Error will appear for the machine-config CO
Actual results:
MCO fails
Expected results:
MCO should come up
Additional info:
Fix has been identified
Description of problem:
Pipelines Creation YAML form is not allowing v1beta1 YAMLs get created
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Open the Pipelines Creation YAML form 2. Paste the following YAML 3. Submit the form
Actual results:
The form doesnot submit, stating version mismatch. Expects v1, got v1beta1
Expected results:
We must support the creation of both the versions in the YAML form
Additional info:
The issue is not observed when the "Import from YAML" Form is used.
Attachment: https://drive.google.com/file/d/1B_sAuGREgmX800JXGmrL30iByowfHzs7/view?usp=sharing
https://github.com/kubernetes/klog is the favored fork of glog, which resolves a lot of issues that are not supported in containerized environments
Description of problem:
The TRT ComponentReadiness tool shows what looks like a regression (https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&baseEndTime=2023-05-16%2023%3A59%3A59&baseRelease=4.13&baseStartTime=2023-04-16%2000%3A00%3A00&capability=Other&component=Monitoring&confidence=95&environment=ovn%20no-upgrade%20amd64%20aws%20hypershift&excludeArches=heterogeneous%2Carm64%2Cppc64le%2Cs390x&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&pity=5&platform=aws&sampleEndTime=2023-07-20%2023%3A59%3A59&sampleRelease=4.14&sampleStartTime=2023-07-13%2000%3A00%3A00&testId=openshift-tests%3A79898d2e28b78374d89e10b38f88107b&testName=%5Bsig-instrumentation%5D%20Prometheus%20%5Bapigroup%3Aimage.openshift.io%5D%20when%20installed%20on%20the%20cluster%20should%20report%20telemetry%20%5BLate%5D%20%5BSkipped%3ADisconnected%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&upgrade=no-upgrade&variant=hypershift) in the "[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster should report telemetry [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" test. In the ComponentReadiness link above, you can see the sample runs (linked with red "F").
Version-Release number of selected component (if applicable):
4.14
How reproducible:
The pass rate in 4.13 is 100% vs. 81% in 4.14
Steps to Reproduce:
1. There query above focuses on "periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance" jobs and the specific test mentioned. You can see the failures by clicking on the red "F"s 2. 3.
Actual results:
The failures look like: { fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:365]: Unexpected error: <errors.aggregate | len:2, cap:2>: [promQL query returned unexpected results: metricsclient_request_send{client="federate_to",job="telemeter-client",status_code="200"} >= 1 [], promQL query returned unexpected results: federate_samples{job="telemeter-client"} >= 10 []] [ <*errors.errorString | 0xc0017611b0>{ s: "promQL query returned unexpected results:\nmetricsclient_request_send{client=\"federate_to\",job=\"telemeter-client\",status_code=\"200\"} >= 1\n[]", }, <*errors.errorString | 0xc00203d380>{ s: "promQL query returned unexpected results:\nfederate_samples{job=\"telemeter-client\"} >= 10\n[]", }, ]
Expected results:
Query should succeed
Additional info:
I set the severity to Major because this looks like a regression from where it was in the 5 weeks before 4.13 went GA.
Description of the problem:
When providing an ICSP in the install config for caching images locally when also using the SaaS the cluster fails to prepare for installation because oc adm release extract is trying to use the ICSP from the install config.
How reproducible:
100% on a fresh deploy, but if the installer cache is already warmed up 0%
Steps to reproduce:
1. Deploy fresh replicas to the SaaS environment
2. Create a cluster
3. Override install config and add ICSP content for an inaccessable (from the SaaS) registry
4. Install cluster
Actual results:
Cluster fails to prepare with an error like:
Failed to prepare the installation due to an unexpected error: failed generating install config for cluster f3e55b14-297d-453b-8ef4-953caebefc67: failed to get installer path: command 'oc adm release extract --command=openshift-install --to=/data/install-config-generate/installercache/quay.io/openshift-release-dev/ocp-release:4.13.0-x86_64 --insecure=false --icsp-file=/tmp/icsp-file1525063401 quay.io/openshift-release-dev/ocp-release:4.13.0-x86_64 --registry-config=/tmp/registry-config882468533' exited with non-zero exit code 1: warning: --icsp-file only applies to images referenced by digest and will be ignored for tags error: unable to read image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:81be8aec46465412abbef5f1ec252ee4a17b043e82d31feac13d25a8a215a2c9: unauthorized: access to the requested resource is not authorized . Please retry later
Expected results:
Installer image is pulled successfully.
Additional Information
This seems to have been introduced in https://github.com/openshift/assisted-service/pull/4115 when we started pulling ICSP information from the install config.
Description of problem:
Cluster Network Operator managed component multus-admission-controller does not conform to Hypershift control plane expectations. When CNO is managed by Hypershift, multus-admission-controller and other CNO-managed deployments should run with non-root security context. If Hypershift runs control plane on kubernetes (as opposed to Openshift) management cluster, it adds pod security context to its managed deployments, including CNO, with runAsUser element inside. In such a case CNO should do the same, set security context for its managed deployments, like multus-admission-controller, to meet Hypershift security rules.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1.Create OCP cluster using Hypershift using Kube management cluster 2.Check pod security context of multus-admission-controller
Actual results:
no pod security context is set on multus-admission-controller
Expected results:
pod security context is set with runAsUser: xxxx
Additional info:
Corresponding CNO change
Description of problem:
Component Readiness is showing a regression in 4.14 compared to 4.13 in the rt variant of test Cluster resource quota should control resource limits across namespaces. Example
{ fail [github.com/openshift/origin/test/extended/quota/clusterquota.go:107]: unexpected error: timed out waiting for the condition Ginkgo exit error 1: exit with code 1}
Looker studio graph (scroll down to see) shows the regression started around May 24th.
Version-Release number of selected component (if applicable):
How reproducible:
4.13 Sippy shows 100% success rate vs. 4.14 which is down to about 91%
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Historical pass rate was 100%
Additional info:
Description of problem:
Same for OCP 4.14.
In OCP 4.13 when trying to reach prometheus UI via port-forward, e.g. `oc port-forward prometheus-k8s-0` the UI url($HOST:9090/graph) is returning `Error opening React index.html: open web/ui/static/react/index.html: no such file or directory`
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-01-24-061922
How reproducible:
100%
Steps to Reproduce:
1. oc -n openshift-monitoring port-forward prometheus-k8s-0 9090:9090 --address='0.0.0.0' 2. curl http://localhost:9090/graph
Actual results:
Error opening React index.html: open web/ui/static/react/index.html: no such file or directory
Expected results:
Prometheus UI is loaded
Additional info:
The UI loads fine when following the same steps in 4.12.
Removes the version check on reconciling the image content type policy since that is not needed in release image versions greater than 4.13.
Description of problem:
visiting global configurations page will return error after 'Red Hat OpenShift Serverless' is installed, the error persist even operator is uninstalled
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-06-212044
How reproducible:
Always
Steps to Reproduce:
1. Subscribe 'Red Hat OpenShift Serverless' from OperatorHub, wait for the operator to be successfully installed 2. Visit Administration -> Cluster Settings -> Configurations tab
Actual results:
react_devtools_backend_compact.js:2367 unhandled promise rejection: TypeError: Cannot read properties of undefined (reading 'apiGroup') at r (main-chunk-e70ea3b3d562514df486.min.js:1:1) at main-chunk-e70ea3b3d562514df486.min.js:1:1 at Array.map (<anonymous>) at main-chunk-e70ea3b3d562514df486.min.js:1:1 overrideMethod @ react_devtools_backend_compact.js:2367 window.onunhandledrejection @ main-chunk-e70ea3b3d562514df486.min.js:1 main-chunk-e70ea3b3d562514df486.min.js:1 Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'apiGroup') at r (main-chunk-e70ea3b3d562514df486.min.js:1:1) at main-chunk-e70ea3b3d562514df486.min.js:1:1 at Array.map (<anonymous>) at main-chunk-e70ea3b3d562514df486.min.js:1:1
Expected results:
no errors
Additional info:
This is a clone of issue OCPBUGS-19512. The following is the description of the original issue:
—
OCPBUGS-5469 and backports began prioritizing later target releases, but we still wait 10m between different PromQL evaluations while evaluating conditional update risks. This ticket is tracking work to speed up cache warming, and allows changes that are too invasive to be worth backporting.
Definition of done:
Acceptance Criteria:
Description of problem:
In an STS cluster with the TechPreviewNoUpgrade featureset enabled, CCO ignores CRs whose .spec.providerSpec.stsIAMRoleARN is unset. While the CR controller does not provision a Secret for the aforementioned type of CRs, it still sets .status.provisioned to true for them.
Steps to Reproduce:
1. Create an STS cluster, enable the feature set. 2. Create a dummy CR like the following: fxie-mac:cloud-credential-operator fxie$ cat cr2.yaml apiVersion: cloudcredential.openshift.io/v1 kind: CredentialsRequest metadata: name: test-cr-2 namespace: openshift-cloud-credential-operator spec: providerSpec: apiVersion: cloudcredential.openshift.io/v1 kind: AWSProviderSpec statementEntries: - action: - ec2:CreateTags effect: Allow resource: '*' secretRef: name: test-secret-2 namespace: default serviceAccountNames: - default 3. Check CR.status fxie-mac:cloud-credential-operator fxie$ oc get credentialsrequest test-cr-2 -n openshift-cloud-credential-operator -o yaml apiVersion: cloudcredential.openshift.io/v1 kind: CredentialsRequest metadata: creationTimestamp: "2023-07-24T09:21:44Z" finalizers: - cloudcredential.openshift.io/deprovision generation: 1 name: test-cr-2 namespace: openshift-cloud-credential-operator resourceVersion: "180154" uid: 34b36cac-3fca-4fa5-a003-a9b64c5fbf00 spec: providerSpec: apiVersion: cloudcredential.openshift.io/v1 kind: AWSProviderSpec statementEntries: - action: - ec2:CreateTags effect: Allow resource: '*' secretRef: name: test-secret-2 namespace: default serviceAccountNames: - default status: lastSyncGeneration: 0 lastSyncTimestamp: "2023-07-24T09:39:40Z" provisioned: true
Description of problem:
After destroyed the private cluster, the cluster's dns records left.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-02-26-022418 4.13.0-0.nightly-2023-02-26-081527
How reproducible:
always
Steps to Reproduce:
1.create a private cluster 2.destroy the cluster 3.check the dns record $ibmcloud dns zones | grep private-ibmcloud.qe.devcluster.openshift.com (base_domain) 3c7af30d-cc2c-4abc-94e1-3bcb36e01a9b private-ibmcloud.qe.devcluster.openshift.com PENDING_NETWORK_ADD $zone_id=3c7af30d-cc2c-4abc-94e1-3bcb36e01a9b $ibmcloud dns resource-records $zone_id CNAME:520c532f-ca61-40eb-a04e-1a2569c14a0b api-int.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com CNAME 60 10a7a6c7-jp-tok.lb.appdomain.cloud CNAME:751cf3ce-06fc-4daf-8a44-bf1a8540dc60 api.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com CNAME 60 10a7a6c7-jp-tok.lb.appdomain.cloud CNAME:dea469e3-01cd-462f-85e3-0c1e6423b107 *.apps.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com CNAME 120 395ec2b3-jp-tok.lb.appdomain.cloud
Actual results:
the dns records of the cluster were left
Expected results:
created dns record by installer are all deleted, after destroyed the cluster
Additional info:
this block create private cluster later, caused the maximum limit of 5 wildcard records are easily reached. (qe account limitation) checking the *ingress-operator.log of the failed cluster, got the error: "createOrUpdateDNSRecord: failed to create the dns record: Reached the maximum limit of 5 wildcard records."
It is caused by the power off routine, which initialises last_error to None. The field is later restored, but BMO manages to observe and record the wrong value.
This issue is not trivial to reproduce in the product. You need OCPBUGS-2471 to land first, then you need to trigger the cleaning failure several times. I used direct access to Ironic via CLI to abort cleaning (`baremetal node abort <node name>`) during deprovisioning. After a few attempts you can observe the following in the BMH's status:
status:
errorCount: 2
errorMessage: 'Cleaning failed: '
errorType: provisioning error
The empty message after the colon is a sign of this bug.
Description of the problem:
If an interface name is over 15 characters long network manager refuses to allow the interface to come up.
How reproducible:
Depends on the system interface names
Steps to reproduce:
1. Create a cluster with static networking (a vlan with a large id works best)
2. Boot a host with the discovery ISO
Actual results:
Host interface does not come up if the resulting interface name is over 15 characters
Expected results:
Interfaces should always come up
Additional info:
Slack thread: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1689956128746919?thread_ts=1689774706.220319&cid=CUPJTHQ5P
Attached a screenshot of the log stating the connection name is too long.
This happens because our script to apply static networking on a host uses the host interface name and appends the extension nmstate added for the interface.
In this case the interface name was enp94s0f0np0 with a vlan id of 2507. This meant that the resulting interface name was enp94s0f0np0.2507 (17 characters).
When configuring this interface manually as a workaround the user stated that the interface name (not the vlan id) was truncated to accommodate the length limit.
So in this case the valid interface created by nmcli was "enp94s0f0n.2507" we should attempt to replicate this behavior.
Also attached a screenshot of the working interface.
Description of problem:
'Show tooltips' toggle is added on resource YAML page, but the checkbox icon seems not aligned with other icons
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-05-23-103225
How reproducible:
Always
Steps to Reproduce:
1. goes to any resource YAML page, check 'Show tooltips' icon position 2. 3.
Actual results:
1. the checkbox is a little above other icons, see screenshot https://drive.google.com/file/d/10wKeRaaE76GBXBph93wAkFCWYGrEKcA9/view?usp=share_link
Expected results:
1. all icons should be aligned
Additional info:
Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-13960.
1. Proposed title of this feature request
Support new Azure LoadBalancer 100min idle TCP timeout
2. What is the nature and description of the request ?
When provisioning a service of type LoadBalancer for OCP cluster on Azure, it is possible to customize TCP idle timeouts in minutes using the LoadBalancer annotations 'service.beta.kubernetes.io/azure-load-balancer-tcp-idle-timeout'
Currently, min and max values are hardcoded to respectively 4 an 30 in both legacy Azure Cloud Provider implementation and cloud Provider Azure
Recently Azure upgraded its implementation to support a max of 100 min for idle timeout, corresponding documentation should be updated soon Configure TCP reset and idle timeout for Azure Load Balancer. It is now possible to use such idle timeout with more than 30min manually in Azure portal or with Azure cli but not possible from K8s load balancer as max value is still 30min in K8s code.
Error message returned is
`Warning SyncLoadBalancerFailed 2s (x3 over 18s) service-controller Error syncing load balancer: failed to ensure load balancer: idle timeout value must be a whole number representing minutes between 4 and 30`
3. Why does the customer need this? (List the business requirements here)
Customer is migrating workloads from on premise datacenter to Azure. Using idle timeout with more than 30min is critical to migrate some of our customer links to Azure and is preventing the migration until this is supported by Openshift
4. List any affected packages or components.
Azure cloud controler
Seeing segfault failures related to HAProxy on multiple platforms that begin around the same time as the [HAProxy bump|http://example.com] like:
{ nodes/ci-op-5s09hi2q-0dd98-rwds8-worker-centralus1-8nkx5/journal.gz:Apr 10 06:21:54.317971 ci-op-5s09hi2q-0dd98-rwds8-worker-centralus1-8nkx5 kernel: haproxy[302399]: segfault at 0 ip 0000556eadddafd0 sp 00007fff0cceed50 error 4 in haproxy[556eadc00000+2a3000]}release-master-ci-4.14-upgrade-from-stable-4.13-e2e-azure-sdn-upgrade/1645265104259780608
periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn-upgrade/1645265114720374784
Description of problem:
The IPv6 VIP does not seem to be present in the keepalived.conf.
networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 - cidr: fd65:10:128::/56 hostPrefix: 64 machineNetwork: - cidr: 192.168.110.0/23 - cidr: fd65:a1a8:60ad::/112 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 - fd65:172:16::/112 platform: vsphere: apiVIPs: - 192.168.110.116 - fd65:a1a8:60ad:271c::1116 ingressVIPs: - 192.168.110.117 - fd65:a1a8:60ad:271c::1117 vcenters: - datacenters: - IBMCloud server: ibmvcenter.vmc-ci.devcluster.openshift.com
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-04-21-084440
How reproducible:
Frequently.
2 failures out of 3 attemps.
Steps to Reproduce:
1. Install vSphere dual-stack with dual VIPs, see above config 2. Check keepalived.conf for f in $(oc get pods -n openshift-vsphere-infra -l app=vsphere-infra-vrrp --no-headers -o custom-columns=N:.metadata.name ) ; do oc -n openshift-vsphere-infra exec -c keepalived $f -- cat /etc/keepalived/keepalived.conf | tee $f-keepalived.conf ; done
Actual results:
IPv6 VIP is not in keepalived.conf
Expected results:
vrrp_instance rbrattai_INGRESS_1 { state BACKUP interface br-ex virtual_router_id 129 priority 20 advert_int 1 unicast_src_ip fd65:a1a8:60ad:271c::cc unicast_peer { fd65:a1a8:60ad:271c:9af:16a9:cb4f:d75c fd65:a1a8:60ad:271c:86ec:8104:1bc2:ab12 fd65:a1a8:60ad:271c:5f93:c9cf:95f:9a6d fd65:a1a8:60ad:271c:bb4:de9e:6d58:89e7 fd65:a1a8:60ad:271c:3072:2921:890:9263 } ... virtual_ipaddress { fd65:a1a8:60ad:271c::1117/128 } ... }
Additional info:
See OPNET-207
colored textTestAlertmanagerUWMSecrets is one the of test that times out see https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1971/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-operator/1661649123104788480. Apparently it takes longer for UWM alertmanager to become ready.
Description of problem:
It seems that we don't correctly update the network data secret version in the PreprovisioningImage, resulting in BMO assuming that the image is still stale, while the image-customization-controller assumes it's done. As a result, the host is stuck in inspecting.
How reproducible:
What I think I did is to add a network data secret to a host which already has a preprovisioningimage previously created. I need to check if I can repeat it.
Actual results:
Host in inspecting, BMO logs show
{"level":"info","ts":"2023-05-11T11:52:52.348Z","logger":"controllers.BareMetalHost","msg":"network data in pre-provisioning image is out of date","baremetalhost":"openshift-machine-api/oste st-extraworker-0","provisioningState":"inspecting","latestVersion":"9055823","currentVersion":"9055820"}
Indeed, the image has the old version:
status: architecture: x86_64 conditions: - lastTransitionTime: "2023-05-11T11:27:51Z" message: Generated image observedGeneration: 1 reason: ImageSuccess status: "True" type: Ready - lastTransitionTime: "2023-05-11T11:27:51Z" message: "" observedGeneration: 1 reason: ImageSuccess status: "False" type: Error format: iso imageUrl: http://metal3-image-customization-service.openshift-machine-api.svc.cluster.local/231b39d5-1b83-484c-9096-aa87c56a222a networkData: name: ostest-extraworker-0-network-config-secret version: "9055820"
What I find puzzling is that we even have two versions of the secret. I only created it once.
Description of problem:
Unable to set protectKernelDefaults from "true" to "false" in kubelet.conf on the nodes in RHOCP4.13 although this was possible in RHOCP4.12.
Version-Release number of selected component (if applicable):
Red Hat OpenShift Container Platform Version Number: 4 Release Number: 13 Kubernetes Version: v1.26.3+b404935 Docker Version: N/A Related Package Version: - cri-o-1.26.3-3.rhaos4.13.git641290e.el9.x86_64 Related Middleware/Application: none Underlying RHEL Release Number: Red Hat Enterprise Linux CoreOS release 4.13 Underlying RHEL Architecture: x86_64 Underlying RHEL Kernel Version: 5.14.0-284.13.1.el9_2.x86_64 Drivers or hardware or architecture dependency: none
How reproducible:
always
Steps to Reproduce:
1. Deploy OCP cluster using RHCOS 2. Set protectKernelDefaults as true using the document [1]
Actual results:
protectKernelDefaults can't be set.
Expected results:
protectKernelDefaults can be set.
Additional info:
protectKernelDefaults in NOT set in kubelet.conf --- # oc debug node/ocp4-worker1 # chroot /host # cat /etc/kubernetes/kubelet.conf ... "protectKernelDefaults": true, <- NOT modified. Moreover, the format is changed to json. ... --- Also "protectKernelDefaults: false" does not seem to be set into the machineConfig created by kubeletConfig Kind. See below: --- # oc get mc 99-worker-generated-kubelet -o yaml ... storage: files: - contents: compression: "" source: data:text/plain;charset=utf-8;base64, [The contents of kubelet.conf encoded with base64] mode: 420 overwrite: true path: /etc/kubernetes/kubelet.conf // Write [The contents of kubelet.conf encoded with base64] to the file. # vim kubelet.conf // Decode [The contents of kubelet.conf encoded with base64] # cat kubelet.conf | base64 -d ... "protectKernelDefaults": true, <- "protectKernelDefaults: false" is not set. ---- [1] https://access.redhat.com/solutions/6974438
Sanitize OWNERS/OWNER_ALIASES:
1) OWNERS must have:
component: "Storage / Kubernetes External Components"
2) OWNER_ALIASES must have all team members of Storage team.
Description of the problem:
After creating successfully a hosted cluster using CAPI agent with 6 worker nodes (on two different subnets), I attempted to scale down the nodepool to 0 replicas.
2 agents returned to infraenv in "known-unbound" state, but the other 4 are still bound to the cluster., and their related machines CR are stuck in phase Deleting
$ oc get machines.cluster.x-k8s.io -n clusters-hosted-1 NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION hosted-1-6655884866-dr4mv hosted-1-vhc4f hosted-rwn-1-1 agent://4cc93549-45cd-42a9-8c61-5d72b802ebe5 Deleting 94m 4.14.0-ec.3 hosted-1-6655884866-fkfjf hosted-1-vhc4f hosted-worker-1-0 agent://324afeeb-1af1-45d9-a2ba-f1101ffb6a6b Deleting 94m 4.14.0-ec.3 hosted-1-6655884866-nzflz hosted-1-vhc4f hosted-rwn-1-2 agent://50b12199-7e95-4b3a-a5ce-d4aa0fa7909e Deleting 94m 4.14.0-ec.3 hosted-1-6655884866-pc67l hosted-1-vhc4f hosted-worker-1-2 agent://284eb9e6-4375-4e59-9a11-a0a3131aa08b Deleting 94m 4.14.0-ec.3
In the capi-provider pod logs I have the following:
time="2023-07-25T15:23:27Z" level=error msg="failed to add finalizer agentmachine.agent-install.openshift.io/deprovision to resource hosted-1-2ntnh clusters-hosted-1" func="github.com/openshift/cluster-api-provider-agent/controllers.(*AgentMachineReconciler).handleDeletionHook" file="/remote-source/app/controllers/agentmachine_controller.go:206" agent_machine=hosted-1-2ntnh agent_machine_namespace=clusters-hosted-1 error="Operation cannot be fulfilled on agentmachines.capi-provider.agent-install.openshift.io \"hosted-1-2ntnh\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/capi-provider.agent-install.openshift.io/agentmachines/clusters-hosted-1/hosted-1-2ntnh, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 75febba6-8e98-4fca-861f-e83c467a3368, UID in object meta: "
and
time="2023-07-25T15:23:50Z" level=error msg="Failed to get agentMachine clusters-hosted-1/hosted-1-l4pp7" func="github.com/openshift/cluster-api-provider-agent/controllers.(*AgentMachineReconciler).Reconcile" file="/remote-source/app/controllers/agentmachine_controller.go:95" agent_machine=hosted-1-l4pp7 agent_machine_namespace=clusters-hosted-1 error="AgentMachine.capi-provider.agent-install.openshift.io \"hosted-1-l4pp7\" not found"
Actual results:
4 out of 6 agents are still bound to cluster
Expected results:
The nodepool is scaled to 0 replicas
Description of problem:
After customizing the routes for Console and Downloads, the `Downloads` route is not being updated within the `https://custom-console-route/command-line-tools` and still pointing the old/default downloads route.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Customize Console and Downloads routes. 2. Access the web-console using custom console route. 3. Go to Command-line-tools. 4. Try to access the downloads urls.
Actual results:
While accessing the downloads urls, it is pointing towards default/old downloads route
Expected results:
While accessing the downloads urls, it should be pointing towards custom downloads route
Additional info:
Description of problem:
As discovered in https://bugzilla.redhat.com/show_bug.cgi?id=2111632 the dispatcher scripts don't have permission to set the hostname directly. We need to use systemd-run to get them into an appropriate SELinux context.
I doubt the static DHCP scripts are still being used intentionally since we have proper static IP support now, but since the fix is pretty trivial we should go ahead and do it since technically the feature is still supported.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
CR.status.lastSyncGeneration is not updated in STS mode (AWS).
Steps to Reproduce:
See https://issues.redhat.com/browse/OCPBUGS-16684.
Description of problem:
On Azure when drop vmsize or location field from cpms's providerSpec, a master will be in a creating/deleting loop.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-25-210451
How reproducible:
always
Steps to Reproduce:
1. Create an Azure cluster with a CPMS 2. Activate the CPMS 3. Drop the vmsize field from the providerSpec
Actual results:
New machine is created, deleted, created, deleted ... $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsuncpms1-7svhz-master-0 Running Standard_D8s_v3 eastus 2 3h21m zhsuncpms1-7svhz-master-1 Running Standard_D8s_v3 eastus 3 3h21m zhsuncpms1-7svhz-master-2 Running Standard_D8s_v3 eastus 1 3h21m zhsuncpms1-7svhz-master-l489k-0 Deleting 0s zhsuncpms1-7svhz-worker-eastus1-6vsl4 Running Standard_D4s_v3 eastus 1 3h16m zhsuncpms1-7svhz-worker-eastus2-dpvp9 Running Standard_D4s_v3 eastus 2 3h16m zhsuncpms1-7svhz-worker-eastus3-sg7dx Running Standard_D4s_v3 eastus 3 19m $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsuncpms1-7svhz-master-0 Running Standard_D8s_v3 eastus 2 3h26m zhsuncpms1-7svhz-master-1 Running Standard_D8s_v3 eastus 3 3h26m zhsuncpms1-7svhz-master-2 Running Standard_D8s_v3 eastus 1 3h26m zhsuncpms1-7svhz-master-wmnfq-0 1s zhsuncpms1-7svhz-worker-eastus1-6vsl4 Running Standard_D4s_v3 eastus 1 3h21m zhsuncpms1-7svhz-worker-eastus2-dpvp9 Running Standard_D4s_v3 eastus 2 3h21m zhsuncpms1-7svhz-worker-eastus3-sg7dx Running Standard_D4s_v3 eastus 3 24m $ oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 4 3 Active 25m $ oc get co control-plane-machine-set NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE control-plane-machine-set 4.12.0-0.nightly-2022-10-25-210451 True True False 4h38m Observed 3 replica(s) in need of update
Expected results:
Errors are logged and no machine is created or new machine could be created successful.
Additional info:
Drop vmSize, we can create new machine, seems default value is Standard_D4s_v3, but don't allow update. $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunazure11-cdbs8-master-0 Running Standard_D8s_v3 eastus 2 4h7m zhsunazure11-cdbs8-master-000 Provisioned Standard_D4s_v3 eastus 2 48s zhsunazure11-cdbs8-master-1 Running Standard_D8s_v3 eastus 3 4h7m zhsunazure11-cdbs8-master-2 Running Standard_D8s_v3 eastus 1 4h7m zhsunazure11-cdbs8-worker-eastus1-5v66l Running Standard_D4s_v3 eastus 1 4h1m zhsunazure11-cdbs8-worker-eastus1-test Running Standard_D4s_v3 eastus 1 7m45s zhsunazure11-cdbs8-worker-eastus2-hm9bm Running Standard_D4s_v3 eastus 2 4h1m zhsunazure11-cdbs8-worker-eastus3-7j9kf Running Standard_D4s_v3 eastus 3 4h1m $ oc edit machineset zhsuncpms1-7svhz-worker-eastus3 error: machinesets.machine.openshift.io "zhsuncpms1-7svhz-worker-eastus3" could not be patched: admission webhook "validation.machineset.machine.openshift.io" denied the request: providerSpec.vmSize: Required value: vmSize should be set to one of the supported Azure VM sizes
Description of problem:
A leftover comment in CPMSO tests is causing a linting issue.
Version-Release number of selected component (if applicable):
4.13.z, 4.14.0
How reproducible:
Always
Steps to Reproduce:
1. make lint 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When using a disconnected env and OPENSHIFT_INSTALL_RELEASE_IMAGE_MIRROR env var is specified, the create-cluster-and-infraenv service fails[*]. Seems that the issue happens due to a missing registries.conf in the assisted-service container, which is required for pulling the image. [*[ create-cluster-and-infraenv[2784]: level=fatal msg="Failed to register cluster with assisted-service: command 'oc adm release info -o template --template '{{.metadata.version}}' --insecure=true quay.io/openshift-release-dev/ocp-release@sha256:3c050cb52fdd3e65c518d4999d238ec026ef724503f275377fee6bf0d33093ab --registry-config=/tmp/registry-config1560177852' exited with non-zero exit code 1: \nerror: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:3c050cb52fdd3e65c518d4999d238ec026ef724503f275377fee6bf0d33093ab: Get "http://quay.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\n"
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. Add registries.conf with mirror config set to a local registry (e.g. use imageContentSources in install-config) 2. Ensure that a custom release image mirror that refers the registry is set on OPENSHIFT_INSTALL_RELEASE_IMAGE_MIRROR env var. 3. Boot the machine on a disconnected env.
Actual results:
create-cluster-and-infraenv service fails pull the release image.
Expected results:
create-cluster-and-infraenv service should finish successfully.
Additional info:
Pushed a PR to the installer for propagating registries.conf: https://github.com/openshift/installer/pull/7332 We have a workaround in the appliance by overriding the service: https://github.com/openshift/appliance/pull/94/
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/470
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Availability requirement updates is disabled on Edit PDB page, also when user tries to edit, it clears the current value so that user has no idea what's the current settings
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-04-03-211601
How reproducible:
Always
Steps to Reproduce:
1. Goes to deployment page -> Actions -> Add PodDisruptionBudget 2. on 'Create PodDisruptionBudge' page, set following fields and hit 'Create' Name: example-pdb Availability requirement: maxUnavailable: 2 3. Make sure pdb/example-pdb is successfully created $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE example-pdb N/A 2 2 99s 4. Goes to deployment page again, Actions -> Edit PodDisruptionBudget
Actual results:
'Availability requirement' value is disabled from editing by default, when user click 'maxUnavailable', the value is set to empty(user has no idea what's the original value)
Expected results:
when editing PDB, we should load the form with current value and user should have permission to update the values by default
Additional info:
Description of problem:
[AWS EBS CSI Driver Operator] should not update the default storageclass annotation back after customers remove the default storageclass annotation
Version-Release number of selected component (if applicable):
Server Version: 4.14.0-0.nightly-2023-06-08-102710
How reproducible:
Always
Steps to Reproduce:
1. Install an aws openshift cluster 2. Create 6 extra storage classes(any sc is ok) 3. Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false and check all the sc are set as undefault 4. Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=true 5. loop step4-5 several times
Actual results:
Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false, sometimes recovered by the driver operator
Expected results:
Overwriter all the sc with the storageclass.kubernetes.io/is-default-class=false should always succeed
Additional info:
Description of problem:
This is a clone of the doc issue OCPBUGS-9162.
Import JAR files doesn't work if the Cluster Samples Operator is not installed. This is a common issue in disconnected clusters where the Cluster Samples Operator is disabled by default. Users should not see the JAR import option if its not working correctly.
Version-Release number of selected component (if applicable):
4.9+
How reproducible:
Always, when the samples operator is not installed
Steps to Reproduce:
Actual results:
Import doesn't work
Expected results:
The Import JAR file option should not be disabled if no "Java" Builder Image (ImageStream in the openshift namespace) is available
Additional info:
This is a clone of issue OCPBUGS-18485. The following is the description of the original issue:
—
Description of problem:
developer console, go to "Observe -> openshift-moniotring -> Alerts", silence Watchdog alert, at the first, the alert state is Silenced in Alerts tab, but changed to Firing quickly(the alert is silenced actually), see the attached screen shoot
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-02-132842
How reproducible:
always
Steps to Reproduce:
1. silence alert in the dev console, and check alert state in Alerts tab 2. 3.
Actual results:
alert state is changed from Silenced to Firing quickly
Expected results:
state should be Silenced
This is a clone of issue OCPBUGS-18788. The following is the description of the original issue:
—
Description of problem:
metal3-baremetal-operator-7ccb58f44b-xlnnd pod failed to start on the SNO baremetal dualstack cluster: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 34m default-scheduler Successfully assigned openshift-machine-api/metal3-baremetal-operator-7ccb58f44b-xlnnd to sno.ecoresno.lab.eng.tlv2.redha t.com Warning FailedScheduling 34m default-scheduler 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are availabl e: 1 node(s) didn't have free ports for the requested pod ports.. Warning FailedCreatePodSandBox 34m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to add hostport mapping for sandbox k8s_metal3-baremetal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0(c4a8b353e3ec105d2bff2eb1670b82a0f226ac1088b739a256deb9dfae6ebe54): cannot open hostport 60000 for pod k8s _metal3-baremetal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0_: listen tcp4 :60000: bind: address already in use Warning FailedCreatePodSandBox 34m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to add hostport mapping for sandbox k8s_metal3-bare metal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0(9e6960899533109b02fbb569c53d7deffd1ac8185cef3d8677254f9ccf9387ff): cannot open hostport 60000 for pod k8s _metal3-baremetal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0_: listen tcp4 :60000: bind: address already in use
Version-Release number of selected component (if applicable):
4.14.0-rc.0
How reproducible:
so far once
Steps to Reproduce:
1. Deploy disconnected baremetal SNO node with dualstack networking with agent-based installer 2. 3.
Actual results:
metal3-baremetal-operator pod fails to start
Expected results:
metal3-baremetal-operator pod is running
Additional info:
Checking the pots on node showed it was `kube-apiserver` process bound to the port: tcp ESTAB 0 0 [::1]:60000 [::1]:2379 users:(("kube-apiserver",pid=43687,fd=455)) After rebooting the node all pods started as expected
Description of problem:
Critical Alert Rules do not have runbook url
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
This bug is being raised by Openshift Monitoring team as part of effort to detect invalid Alert Rules in OCP. 1. Check details of KubeSchedulerDown Alert Rule 2. 3.
Actual results:
The Alert Rule KubeSchedulerDown has Critical Severity, but does not have runbook_url annotation.
Expected results:
All Critical Alert Rules must have runbbok_url annotation
Additional info:
Critical Alerts must have a runbook, please refer to style guide at https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide The runbooks are located at github.com/openshift/runbooks To resolve the bug, - Add runbooks for the relevant Alerts at github.com/openshift/runbooks - Add the link to the runbook in the Alert annotation 'runbook_url' - Remove the exception in the origin test, added in PR https://github.com/openshift/origin/pull/27933
Description of problem:
The reconciler removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources whether the pod is alive or not.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create pods and check the overlappingrangeipreservations.whereabouts.cni.cncf.io resources:
$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A NAMESPACE NAME AGE openshift-multus 2001-1b70-820d-4b04--13 4m53s openshift-multus 2001-1b70-820d-4b05--13 4m49s
2. Verify that when the ip-reconciler cronjob removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources when run:
$ oc get cronjob -n openshift-multus NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ip-reconciler */15 * * * * False 0 14m 4d13h $ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A No resources found $ oc get cronjob -n openshift-multus NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ip-reconciler */15 * * * * False 0 5s 4d13h
Actual results:
The overlappingrangeipreservations.whereabouts.cni.cncf.io resources are removed for each created pod by the ip-reconciler cronjob. The "overlapping ranges" are not used.
Expected results:
The overlappingrangeipreservations.whereabouts.cni.cncf.io should not be removed regardless of if a pod has used an IP in the overlapping ranges.
Additional info:
Description of problem:
User defined taints in machineset, then scale up machineset, instance can join the cluster and Node will be Ready but pod couldn't be deployed, checked node yaml file uninitialized taint was not removed.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Always
Steps to Reproduce:
1.Setup a cluster on Azure 2.Create a machineset with taint taints: - effect: NoSchedule key: mapi value: mapi_test 3.Check node yaml file
Actual results:
uninitialized taint still in node, but no providerID in node. $ oc get node NAME STATUS ROLES AGE VERSION zhsun724-mh4dt-master-0 Ready control-plane,master 9h v1.27.3+4aaeaec zhsun724-mh4dt-master-1 Ready control-plane,master 9h v1.27.3+4aaeaec zhsun724-mh4dt-master-2 Ready control-plane,master 9h v1.27.3+4aaeaec zhsun724-mh4dt-worker-westus21-8rzqw Ready worker 21m v1.27.3+4aaeaec zhsun724-mh4dt-worker-westus21-additional-q58zp Ready worker 9h v1.27.3+4aaeaec zhsun724-mh4dt-worker-westus21-additional-vwwhh Ready worker 9h v1.27.3+4aaeaec zhsun724-mh4dt-worker-westus21-v7k7s Ready worker 9h v1.27.3+4aaeaec zhsun724-mh4dt-worker-westus22-ggxql Ready worker 9h v1.27.3+4aaeaec zhsun724-mh4dt-worker-westus23-zf8l5 Ready worker 9h v1.27.3+4aaeaec $ oc edit node zhsun724-mh4dt-worker-westus21-8rzqw spec: taints: - effect: NoSchedule key: node.cloudprovider.kubernetes.io/uninitialized value: "true" - effect: NoSchedule key: mapi value: mapi_test
Expected results:
uninitialized taint is removed, providerID is set in node.
Additional info:
must-gather: https://drive.google.com/file/d/12ypYmHN98j9lyWCS9Dgaqq5MLpftqEkS/view?usp=sharing
It seems the e2e-metal-ipi-ovn-dualstack job is permafailing the last couple of days.
sippy link
one common symptom seems to be that some nodes are being fully provisioned.
here is an example from this job
you can see the clusteroperators are not happy and specifically machine-api is stuck in init
Description of problem:
OCP 4.14 installation fails. Waiting for the UPI installation to complete using the wait-for, ends with a CO error: ``` $ openshift-install wait-for install-complete --log-level=debug level=error msg=failed to initialize the cluster: Cluster operator control-plane-machine-set is not available ``` ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 122m Unable to apply 4.14.0-0.nightly-2023-07-18-085740: the cluster operator control-plane-machine-set is not available ``` ``` $ oc get co | grep control-plane-machine-set control-plane-machine-set 4.14.0-0.nightly-2023-07-18-085740 False False True 6h47m Missing 3 available replica(s) ```
Version-Release number of selected component (if applicable):
Openshift on Openstack OCP 4.14.0-0.nightly-2023-07-18-085740 RHOS-16.2-RHEL-8-20230413.n.1 UPI installation
How reproducible:
Always
Steps to Reproduce:
Run the UPI openshift installation
Actual results:
UPI installation fail
Expected results:
UPI installation pass
Additional info:
$ oc logs -n openshift-machine-api control-plane-machine-set-operator-5cbb7f68cc-h5f4p | tail E0719 14:20:52.645504 1 controller.go:649] "msg"="Observed unmanaged control plane nodes" "error"="found unmanaged control plane nodes, the following node(s) do not have associated machines: ostest-c2drn-master-0, ostest-c2drn-master-1, ostest-c2drn-master-2" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="1984ddf9-506f-4d10-88e5-0787b305484e" "unmanagedNodes"="ostest-c2drn-master-0,ostest-c2drn-master-1,ostest-c2drn-master-2" I0719 14:20:52.645530 1 controller.go:268] "msg"="Cluster state is degraded. The control plane machine set will not take any action until issues have been resolved." "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="1984ddf9-506f-4d10-88e5-0787b305484e" I0719 14:20:52.667462 1 controller.go:212] "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="1984ddf9-506f-4d10-88e5-0787b305484e" I0719 14:20:52.668013 1 controller.go:156] "msg"="Reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="3f095b75-21af-4475-b0fd-25052e8c3bce" I0719 14:20:52.668718 1 controller.go:121] "msg"="Reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="e80d898c-9a8d-4774-8f22-fb464be45758" I0719 14:20:52.668780 1 controller.go:142] "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="e80d898c-9a8d-4774-8f22-fb464be45758" I0719 14:20:52.669005 1 status.go:119] "msg"="Observed Machine Configuration" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "observedGeneration"=1 "readyReplicas"=0 "reconcileID"="3f095b75-21af-4475-b0fd-25052e8c3bce" "replicas"=0 "unavailableReplicas"=3 "updatedReplicas"=0 E0719 14:20:52.669237 1 controller.go:649] "msg"="Observed unmanaged control plane nodes" "error"="found unmanaged control plane nodes, the following node(s) do not have associated machines: ostest-c2drn-master-0, ostest-c2drn-master-1, ostest-c2drn-master-2" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="3f095b75-21af-4475-b0fd-25052e8c3bce" "unmanagedNodes"="ostest-c2drn-master-0,ostest-c2drn-master-1,ostest-c2drn-master-2" I0719 14:20:52.669267 1 controller.go:268] "msg"="Cluster state is degraded. The control plane machine set will not take any action until issues have been resolved." "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="3f095b75-21af-4475-b0fd-25052e8c3bce" I0719 14:20:52.669842 1 controller.go:212] "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="3f095b75-21af-4475-b0fd-25052e8c3bce"
[cloud-user@installer-host ~]$ oc get nodes NAME STATUS ROLES AGE VERSION ostest-c2drn-master-0 Ready control-plane,master 6h55m v1.27.3+4aaeaec ostest-c2drn-master-1 Ready control-plane,master 6h55m v1.27.3+4aaeaec ostest-c2drn-master-2 Ready control-plane,master 6h55m v1.27.3+4aaeaec ostest-c2drn-worker-0 Ready worker 6h36m v1.27.3+4aaeaec ostest-c2drn-worker-1 Ready worker 6h35m v1.27.3+4aaeaec ostest-c2drn-worker-2 Ready worker 6h36m v1.27.3+4aaeaec
Description of problem:
On command-line-tools page,the title is "Command line tools" instead of "Command Line Tools"
Version-Release number of selected component (if applicable):
How reproducible:
1/1
Steps to Reproduce:
1.goto command-line-tools page 2.check the title
Actual results:
the title is "Command line tools"
Expected results:
the title should be "Command Line Tools"
Additional info:
When implementing support for IPv6-primary dual-stack clusters, we have extended the available IP families to
const ( IPFamiliesIPv4 IPFamiliesType = "IPv4" IPFamiliesIPv6 IPFamiliesType = "IPv6" IPFamiliesDualStack IPFamiliesType = "DualStack" IPFamiliesDualStackIPv6Primary IPFamiliesType = "DualStackIPv6Primary" )
At the same time definitions of kubelet.service systemd unit still contain the code
{{- if eq .IPFamilies "DualStack"}} --node-ip=${KUBELET_NODE_IPS} \ {{- else}} --node-ip=${KUBELET_NODE_IP} \ {{- end}}
which only matches the "old" dual-stack family. Because of this, an IPv6-primary dual-stack renders node-ip param with only 1 IP address instead of 2 as required in dual-stack.
Description of problem:
the acm dropdown has a filter and clusters title even though there are only ever 2 items in the dropdown, local cluster and all clusters. it has been reported by a customer as confusing that they can add many clusters to the dropdown.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. install acm dynamic plugin to cluster 2. open cluster dropdown 3.
Actual results:
Expected results:
Additional info:
Description of problem:
RHCOS is being published to new AWS regions (https://github.com/openshift/installer/pull/6861) but aws-sdk-go need to be bumped to recognize those regions
Version-Release number of selected component (if applicable):
master/4.14
How reproducible:
always
Steps to Reproduce:
1. openshift-install create install-config 2. Try to select ap-south-2 as a region 3.
Actual results:
New regions are not found. New regions are: ap-south-2, ap-southeast-4, eu-central-2, eu-south-2, me-central-1.
Expected results:
Installer supports and displays the new regions in the Survey
Additional info:
See https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/regions.go#L13-L23
Description of problem:
oc patch project command is failing to annotate the project
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. Run the below patch command to update the annotation on existing project ~~~ oc patch project <PROJECT_NAME> --type merge --patch '{"metadata":{"annotations":{"openshift.io/display-name": "null","openshift.io/description": "This is a new project"}}}' ~~~
Actual results:
It produces the error output below: ~~~ The Project "<PROJECT_NAME>" is invalid: * metadata.namespace: Invalid value: "<PROJECT_NAME>": field is immutable * metadata.namespace: Forbidden: not allowed on this type ~~~
Expected results:
The `oc patch project` command should patch the project with specified annotation.
Additional info:
Tried to patch the project with OCP 4.11.26 version, and it worked as expected. ~~~ oc patch project <PROJECT_NAME> --type merge --patch '{"metadata":{"annotations":{"openshift.io/display-name": "null","openshift.io/description": "New project"}}}' project.project.openshift.io/<PROJECT_NAME> patched ~~~ The issue is with OCP 4.12, where it is not working.
Description of problem:
When we rebased to 1.26, the rebase picked up https://github.com/kubernetes-sigs/cloud-provider-azure/pull/2653/ which made the Azure cloud node manager stop applying beta toplogy labels, such as failure-domain.beta.kubernetes.io/zone Since we haven't completed the removal cycle for this, we still need the node manager to apply these labels. In the future we must ensure that these labels are available until users are no longer using them.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create a TP cluster on 4.13 2. Observe no beta label for zone or region 3.
Actual results:
Beta labels are not present
Expected results:
Beta labels are present and should match GA labels
Additional info:
Created https://github.com/kubernetes-sigs/cloud-provider-azure/pull/3685 to try and make upstream allow this to be flagged
Description of problem:
When the configuration is installed with the config-image, the kubeadmin-password it not accepted to log into the console.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Build and install unconfigured ignition 2. Build and install config-image 3. When able to ssh into host0, attempt to log into console using the core user and generated kubeadmin-password.
Actual results:
The login fails.
Expected results:
The login should succeed.
Additional info:
Description of problem:
When creating an OCP cluster with Nutanix infrastructure and using DHCP instead of IPAM network config, the Hostname of the VM is not set by DHCP. In these case we need to inject the desired hostname through cloud-init for both control-plane and worker nodes.
Version-Release number of selected component (if applicable):
How reproducible:
Reproducible when creating an OCP cluster with Nutanix infrastructure and using DHCP instead of IPAM network config.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The aforementioned test in the e2e origin test suite sometimes fails because it can't connect to the API endpoint.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Sometimes
Steps to Reproduce:
1. See https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn-upgrade/1673703516675248128 2. 3.
Actual results:
The test failed.
Expected results:
The test should retry a couple of times with a delay when it didn't get an HTTP response from the endpoint (e.g. connection issue).
Additional info:
This is a clone of issue OCPBUGS-18137. The following is the description of the original issue:
—
Description of problem:
When a workload includes a node selector term on the label kubernetes.io/arch and the allowed values do not include amd64, the auto scaler does not trigger the scale out of a valid, non-amd64, machine set if its current replicas are 0 and (for 4.14+) no architecture capacity annotation is set (ref MIXEDARCH-129).
The issue is due to https://github.com/openshift/kubernetes-autoscaler/blob/f0ceeacfca57014d07f53211a034641d52d85cfd/cluster-autoscaler/cloudprovider/utils.go#L33
This bug should be considered at first on clusters having the same architecture for the control plane and the data plane.
In the case of multi-arch compute clusters, there is probably no alternative than letting the capacity annotation to be properly set in the machine set either manually or by the cloud provider actuator, as already discussed in the MIXEDARCH-129 works, otherwise relying to the control plane architecture.
Version-Release number of selected component (if applicable):
- ARM64 IPI on GCP 4.14 - ARM64 IPI on Aws and Azure <=4.13 - In general, non-amd64 single-arch clusters supporting autoscale from 0
How reproducible:
Always
Steps to Reproduce:
1. Create an arm64 IPI cluster on GCP 2. Set one of the machinesets to have 0 replicas: oc scale -n openshift-machine-api machineset/adistefa-a1-zn8pg-worker-f 3. Deploy the default autoscaler 4. Deploy the machine autoscaler for the given machineset 5. Deploy a workload with node affinity to arm64 only nodes, large resource requests and enough number of replicas.
Actual results:
From the pod events: pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
Expected results:
The cluster autoscaler scales the machineset with 0 replicas in order to provide resources for the pending pods.
Additional info:
--- apiVersion: autoscaling.openshift.io/v1 kind: ClusterAutoscaler metadata: name: default spec: {} --- apiVersion: autoscaling.openshift.io/v1beta1 kind: MachineAutoscaler metadata: name: worker-us-east-1a namespace: openshift-machine-api spec: minReplicas: 0 maxReplicas: 12 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: adistefa-a1-zn8pg-worker-f --- apiVersion: apps/v1 kind: Deployment metadata: namespace: openshift-machine-api name: 'my-deployment' annotations: {} spec: selector: matchLabels: app: name replicas: 3 template: metadata: labels: app: name spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/arch operator: In values: - "arm64" containers: - name: container image: >- image-registry.openshift-image-registry.svc:5000/openshift/httpd:latest ports: - containerPort: 8080 protocol: TCP env: [] resources: requests: cpu: "2" imagePullSecrets: [] strategy: type: RollingUpdate rollingUpdate: maxSurge: 25% maxUnavailable: 25% paused: false
Description of problem:
Dev sandbox - CronJobs table/details UI doesn't have Suspend indication
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Create sample CronJob with either @daily or @hourly as schedule 2. Navigate to Administrator/Workloads/CronJobs area 3. Observe that table with CronJobs contain your created entry, but no column with Suspend True/False indication 4. Navigate into that same cron job details - still no presence of Suspend state 5. Then invoke 'oc get cj' command and example output could be: NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE example @hourly True 0 24m 34m where you could see separate SUSPEND column
Actual results:
Expected results:
Additional info:
Make SNO dev-preview on 4.13 for P and Z
As a HyperShift developer, I would like a config file created to control the creation frequency of RHTAP PRs so that the HyperShift repo & CI is not inundated with RHTAP PRs.
Description of problem:
At moment we are using an alpha version of controller-runtime on the machine-api-operator. Now that controller-runtime v0.15.0 is out, we want to bump to it.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
oc adm node-logs feature has been upstreamed and is part of k8s 1.27. This resulted in the addition kubelet configuration enableSystemLogQuery to enable the feature. This feature has been enabled in the base kubelet configs in MCO. However in situations where TechPreview is enabled, it causes MCO to generate a kubelet configuration that overwrites the default and when it does this, the unmarshal and marshal cycle drops the field it is not aware of. This is because MCO currently vendors in k8s.io/kubelet at v0.25.1 and can be fixed by vendoring in v0.27.1
How reproducible:always
Steps to Reproduce:
1. Bring up a 4.14 cluster with TechPreview enabled 2. Run oc adm node-logs 3.
Actual results:
Command returns "<a href="ec274df5b608cc7a149ece1ce673306c/">ec274df5b608cc7a149ece1ce673306c/</a>" which is the contents of /var/log/journal
Expected results:
Should return journal logs from the node
I took a quick cut of updating the OpenShift and k8s APIs to 1.27. Running into the following during make verify:
cmd/machine-config-controller/start.go:18:2: could not import github.com/openshift/machine-config-operator/pkg/controller/template (-: # github.com/openshift/machine-config-operator/pkg/controller/template pkg/controller/template/render.go:396:91: cannot use cfg.FeatureGate (variable of type *"github.com/openshift/api/config/v1".FeatureGate) as featuregates.FeatureGateAccess value in argument to cloudprovider.IsCloudProviderExternal: *"github.com/openshift/api/config/v1".FeatureGate does not implement featuregates.FeatureGateAccess (missing method AreInitialFeatureGatesObserved) pkg/controller/template/render.go:441:90: cannot use cfg.FeatureGate (variable of type *"github.com/openshift/api/config/v1".FeatureGate) as featuregates.FeatureGateAccess value in argument to cloudprovider.IsCloudProviderExternal: *"github.com/openshift/api/config/v1".FeatureGate does not implement featuregates.FeatureGateAccess (missing method AreInitialFeatureGatesObserved)) (typecheck) "github.com/openshift/machine-config-operator/pkg/controller/template" ^
Here are some examples of how other operators have handled this.
This is a critical bug as oc adm node-logs runs as part of must-gather and debugging node issues with TechPreview jobs in CI is impossible without this working.
Description of problem:
when searching InstallPlans with specific project selected, still all IPs are listed, the selected project is not applied in filter
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-05-112833
How reproducible:
Always
Steps to Reproduce:
1. Install some operators to specific namespace and all namespaces $ oc get ip -A NAMESPACE NAME CSV APPROVAL APPROVED default install-tftg4 etcdoperator.v0.9.4 Automatic true openshift-operators install-5g2l4 3scale-community-operator.v0.10.1 Automatic true $ oc get sub -A NAMESPACE NAME PACKAGE SOURCE CHANNEL default etcd etcd community-operators singlenamespace-alpha openshift-operators 3scale-community-operator 3scale-community-operator community-operators threescale-2.13 2. navigates to Home -> Search page, select project 'default' in project dropdown, choose 'InstallPlan' resource 3. check the filtered lists
Actual results:
3. InstallPlans in all namespaces are listed
Expected results:
3. only the InstallPlan in 'default' project should be listed
Additional info:
This is a clone of issue OCPBUGS-18720. The following is the description of the original issue:
—
Description of problem:
Catalog pods in hypershift control plane in ImagePullBackOff
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create a cluster in 4.14 HO + OCP 4.14.0-0.ci-2023-09-07-120503 2. Check controlplane pods, catalog pods in control plane namespace in ImagePullBackOff 3.
Actual results:
jiezhao-mac:hypershift jiezhao$ oc get pods -n clusters-jie-test | grep catalog catalog-operator-64fd787d9c-98wx5 2/2 Running 0 2m43s certified-operators-catalog-7766fc5b8-4s66z 0/1 ImagePullBackOff 0 2m43s community-operators-catalog-847cdbff6-wsf74 0/1 ImagePullBackOff 0 2m43s redhat-marketplace-catalog-fccc6bbb5-2d5x4 0/1 ImagePullBackOff 0 2m43s redhat-operators-catalog-86b6f66d5d-mpdsc 0/1 ImagePullBackOff 0 2m43s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 65m default-scheduler Successfully assigned clusters-jie-test/certified-operators-catalog-7766fc5b8-4s66z to ip-10-0-64-135.us-east-2.compute.internal Normal AddedInterface 65m multus Add eth0 [10.128.2.141/23] from openshift-sdn Normal Pulling 63m (x4 over 65m) kubelet Pulling image "from:imagestream" Warning Failed 63m (x4 over 65m) kubelet Failed to pull image "from:imagestream": rpc error: code = Unknown desc = reading manifest imagestream in docker.io/library/from: requested access to the resource is denied Warning Failed 63m (x4 over 65m) kubelet Error: ErrImagePull Warning Failed 63m (x6 over 65m) kubelet Error: ImagePullBackOff Normal BackOff 9s (x280 over 65m) kubelet Back-off pulling image "from:imagestream" jiezhao-mac:hypershift jiezhao$
Expected results:
catalog pods are running
Additional info:
slack: https://redhat-internal.slack.com/archives/C01C8502FMM/p1694170060144859
Description of problem:
Running the following tests using Openshift on Openstack with Kuryr "[sig-cli] oc idle [apigroup:apps.openshift.io][apigroup:route.openshift.io][apigroup:project.openshift.io][apigroup:image.openshift.io] by all [Suite:openshift/conformance/parallel]" "[sig-cli] oc idle [apigroup:apps.openshift.io][apigroup:route.openshift.io][apigroup:project.openshift.io][apigroup:image.openshift.io] by checking previous scale [Suite:openshift/conformance/parallel]" "[sig-cli] oc idle [apigroup:apps.openshift.io][apigroup:route.openshift.io][apigroup:project.openshift.io][apigroup:image.openshift.io] by label [Suite:openshift/conformance/parallel]" "[sig-cli] oc idle [apigroup:apps.openshift.io][apigroup:route.openshift.io][apigroup:project.openshift.io][apigroup:image.openshift.io] by name [Suite:openshift/conformance/parallel]" Fails waiting for endpoints STEP: wait until endpoint addresses are scaled to 2 01/21/23 01:16:42.024 Jan 21 01:16:42.025: INFO: Running 'oc --namespace=e2e-test-oc-idle-h2mvt --kubeconfig=/tmp/configfile3007731725 get endpoints idling-echo --template={{ len (index .subsets 0).addresses }} --output=go-template' Jan 21 01:16:42.158: INFO: Error running /usr/local/bin/oc --namespace=e2e-test-oc-idle-h2mvt --kubeconfig=/tmp/configfile3007731725 get endpoints idling-echo --template={{ len (index .subsets 0).addresses }} --output=go-template: StdOut> Error executing template: template: output:1:8: executing "output" at <index .subsets 0>: error calling index: index of untyped nil. Printing more information for debugging the template: template was: {{ len (index .subsets 0).addresses }} raw data was: {"apiVersion":"v1","kind":"Endpoints","metadata":{"annotations":{"endpoints.kubernetes.io/last-change-trigger-time":"2023-01-21T01:16:40Z"},"creationTimestamp":"2023-01-21T01:16:40Z","labels":{"app":"idling-echo"},"managedFields":[{"apiVersion":"v1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:endpoints.kubernetes.io/last-change-trigger-time":{}},"f:labels":{".":{},"f:app":{}}}},"manager":"kube-controller-manager","operation":"Update","time":"2023-01-21T01:16:40Z"}],"name":"idling-echo","namespace":"e2e-test-oc-idle-h2mvt","resourceVersion":"409973","uid":"91cd122e-b418-4e29-98c6-2ff757c74a15"}} object given to template engine was: map[apiVersion:v1 kind:Endpoints metadata:map[annotations:map[endpoints.kubernetes.io/last-change-trigger-time:2023-01-21T01:16:40Z] creationTimestamp:2023-01-21T01:16:40Z labels:map[app:idling-echo] managedFields:[map[apiVersion:v1 fieldsType:FieldsV1 fieldsV1:map[f:metadata:map[f:annotations:map[.:map[] f:endpoints.kubernetes.io/last-change-trigger-time:map[]] f:labels:map[.:map[] f:app:map[]]]] manager:kube-controller-manager operation:Update time:2023-01-21T01:16:40Z]] name:idling-echo namespace:e2e-test-oc-idle-h2mvt resourceVersion:409973 uid:91cd122e-b418-4e29-98c6-2ff757c74a15]] When using 60 seconds in PollImmediate instead of 30 the tests pass.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-01-19-110743
How reproducible:
Consistently
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
DoD:
Currently we return early if we fail to apply a resource during installation https://github.com/openshift/hypershift/blob/main/cmd/install/install.go#L248
There's no reason why we wouldn't keep going, aggregate errors and return at the end.
It might help for scenarios where one broken CR prevent everything else from being installed, e.g.
Description of problem:
We need to update the operator to be synced with the K8 api version used by OCP 4.13. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
Events search should not be case sensitive
How reproducible:
100%
Steps to reproduce:
1. On UI View Cluster Events
2. Enter text on "Filter by text" field. (eg. "success" or "Success" )
Actual results:
Events filter is case sensitive.
See screenshots enclosed
Expected results:
Events filter should not be case sensitive
Description of problem:
CRL list is capped at 1MB due to configmap max size. If multiple public CRLs are needed for ingress controller the CRL pem file will be over 1MB.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create CRL configmap with the following distribution points: Issuer: C=US, O=DigiCert Inc, CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1 Subject: SOME SIGNED CERT X509v3 CRL Distribution Points: Full Name: URI:http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.cr # curl -o DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl # openssl crl -in DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl -inform DER -out DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem # du -bsh DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem 604K DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem I still need to find more intermediate CRLS to grow this.
Actual results:
2023-01-25T13:45:01.443Z ERROR operator.init controller/controller.go:273 Reconciler error {"controller": "crl", "object": {"name":"custom","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "custom", "reconcileID": "d49d9b96-d509-4562-b3d9-d4fc315226c0", "error": "failed to ensure client CA CRL configmap for ingresscontroller openshift-ingress-operator/custom: failed to update configmap: ConfigMap \"router-client-ca-crl-custom\" is invalid: []: Too long: must have at most 1048576 bytes"}
Expected results:
First be able to create a configmap where data only accounted to the 1MB max (see additional info below for more details), second some way to compress or allow a large CRL list that would be larger than 1MB
Additional info:
Only using this CRL and it being only 600K still causes issue and it could be due to the `last-applied-configuration` annotation on the configmap. This is added since we do an apply operation (update) on the configmap. I am not sure if this is counting towards the 1MB max. https://github.com/openshift/cluster-ingress-operator/blob/release-4.10/pkg/operator/controller/crl/crl_configmap.go#L295 Not sure if we could just replace the configmap.
Description of problem:
node-driver-registrar and hostpath containers in pod shared-resource-csi-driver-node-xxxxx under openshift-cluster-csi-drivers namespace are not pinned to reserved management cores.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Deploy SNO via ZTP with workload partitioning enabled 2. Check mgmt pods affinity 3.
Actual results:
pods do not have workload partitioning annotation, and are not pinned to mgmt cores
Expected results:
All management pods should be pinned to reserved cores Pod should be annotated with: target.workload.openshift.io/management: '{"effect":"PreferredDuringScheduling"}'
Additional info:
pod metadata metadata: annotations: k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["fd01:0:0:1::5f/64"],"mac_address":"0a:58:97:51:ad:31","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:1::5f/64","gateway_ip":"fd01:0:0:1::1"}}' k8s.v1.cni.cncf.io/network-status: |- [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "fd01:0:0:1::5f" ], "mac": "0a:58:97:51:ad:31", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "fd01:0:0:1::5f" ], "mac": "0a:58:97:51:ad:31", "default": true, "dns": {} }] openshift.io/scc: privileged /var/lib/jenkins/workspace/ocp-far-edge-vran-deployment/cnf-gotests/test/ran/workloadpartitioning/tests/workload_partitioning.go:113 SNO management workload partitioning [It] should have management pods pinned to reserved cpus /var/lib/jenkins/workspace/ocp-far-edge-vran-deployment/cnf-gotests/test/ran/workloadpartitioning/tests/workload_partitioning.go:113 [FAILED] Expected <[]ranwphelper.ContainerInfo | len:3, cap:4>: [ { Name: "hostpath", Cpus: "2-55,58-111", Namespace: "openshift-cluster-csi-drivers", PodName: "shared-resource-csi-driver-node-vzvtc", Shares: 10, Pid: 41650, }, { Name: "cluster-proxy-service-proxy", Cpus: "2-55,58-111", Namespace: "open-cluster-management-agent-addon", PodName: "cluster-proxy-service-proxy-66599b78bf-k2dvr", Shares: 2, Pid: 35093, }, { Name: "node-driver-registrar", Cpus: "2-55,58-111", Namespace: "openshift-cluster-csi-drivers", PodName: "shared-resource-csi-driver-node-vzvtc", Shares: 10, Pid: 34782, }, ] to be empty In [It] at: /var/lib/jenkins/workspace/ocp-far-edge-vran-deployment/cnf-gotests/test/ran/workloadpartitioning/ranwphelper/ranwphelper.go:172 @ 02/22/23 01:05:00.268 cluster-proxy-service-proxy is reported in https://issues.redhat.com/browse/OCPBUGS-7652
X-CSRF token is currently added automatically for any request using `coFetch` functions. In some cases, plugins would like to use their own functions/libs like axios. Console should enable retrieving the X-CSRF token
Acceptance Criteria:
Description of problem:
The current version of openshift/cluster-dns-operator vendors Kubernetes 1.26 packages. OpenShift 4.14 is based on Kubernetes 1.27.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Check https://github.com/openshift/cluster-dns-operator/blob/release-4.14/go.mod
Actual results:
Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.26
Expected results:
Kubernetes packages are at version v0.27.0 or later.
Additional info:
Using old Kubernetes API and client packages brings risk of API compatibility issues. controller-runtime will need to be bumped to v0.15.0 as well
Description of problem:
accidentally merged before fully reviewed
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The CPO does not currently respect the CVO runlevels as standalone OCP does.
The CPO reconciles everything all at once during upgrades which is resulting in FeatureSet aware components trying to start because the FeatureSet status is set for that version, leading to pod restarts.
It should roll things out in the following order for both initial install and upgrade, waiting between stages until rollout is complete:
In many cases, the /dev/disk/by-path symlink is the only way to stably identify a disk without having prior knowledge of the hardware from some external source (e.g. a spreadsheet of disk serial numbers). It should be possible to specify this path in the root device hints.
This is fixed by the first commit in the upstream Metal³ PR https://github.com/metal3-io/baremetal-operator/pull/1264
Description of problem:
The usage of "compute.platform.gcp.serviceAccount" needs to be clarified, and also the installation failure.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-16-230237
How reproducible:
Always
Steps to Reproduce:
1. "openshift-install explain installconfig.compute.platform.gcp.serviceAccount" 2. "create cluster" with an existing install-config having the field configured
Actual results:
1. It tells "The provided service account will be attached to control-plane nodes...", although the field is under compute.platform.gcp. 2. The installation failed on creating install config, with error "service accounts only valid for master nodes, provided for worker nodes".
Expected results:
1. shall "explain" command tell the field "serviceAccount" under "installconfig.compute.platform.gcp"? 2. please clarify how "compute.platform.gcp.serviceAccount" should be used
Additional info:
FYI the corresponding PR: https://github.com/openshift/installer/pull/7308 $ openshift-install version openshift-install 4.14.0-0.nightly-2023-07-16-230237 built from commit c2d7db9d4eedf7b79fcf975f3cbd8042542982ca release image registry.ci.openshift.org/ocp/release@sha256:e31716b6f12a81066c78362c2f36b9f18ad51c9768bdc894d596cf5b0f689681 release architecture amd64 $ openshift-install explain installconfig.compute.platform.gcp.serviceAccount KIND: InstallConfig VERSION: v1RESOURCE: <string> ServiceAccount is the email of a gcp service account to be used for shared vpn installations. The provided service account will be attached to control-plane nodes in order to provide the permissions required by the cloud provider in the host project. $ openshift-install explain installconfig.controlPlane.platform.gcp.serviceAccount KIND: InstallConfig VERSION: v1RESOURCE: <string> ServiceAccount is the email of a gcp service account to be used for shared vpn installations. The provided service account will be attached to control-plane nodes in order to provide the permissions required by the cloud provider in the host project. $ yq-3.3.0 r test2/install-config.yaml platform gcp: projectID: openshift-qe region: us-central1 computeSubnet: installer-shared-vpc-subnet-2 controlPlaneSubnet: installer-shared-vpc-subnet-1 network: installer-shared-vpc networkProjectID: openshift-qe-shared-vpc $ yq-3.3.0 r test2/install-config.yaml credentialsMode Passthrough $ yq-3.3.0 r test2/install-config.yaml baseDomain qe1.gcp.devcluster.openshift.com $ yq-3.3.0 r test2/install-config.yaml metadata creationTimestamp: null name: jiwei-0718b $ yq-3.3.0 r test2/install-config.yaml compute - architecture: amd64 hyperthreading: Enabled name: worker platform: gcp: ServiceAccount: ipi-xpn-minpt-permissions@openshift-qe.iam.gserviceaccount.com tags: - preserved-ipi-xpn-compute replicas: 2 $ yq-3.3.0 r test2/install-config.yaml controlPlane architecture: amd64 hyperthreading: Enabled name: master platform: gcp: ServiceAccount: ipi-xpn-minpt-permissions@openshift-qe.iam.gserviceaccount.com tags: - preserved-ipi-xpn-control-plane replicas: 3 $ openshift-install create cluster --dir test2 ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: compute[0].platform.gcp.serviceAccount: Invalid value: "ipi-xpn-minpt-permissions@openshift-qe.iam.gserviceaccount.com": service accounts only valid for master nodes, provided for worker nodes $
due to
kubeconfig didn't become available: timed out waiting for the condition
Description of problem:
When listing installed operators, we attempt to list subscriptions in all namespaces in order to associate subscriptions/csvs. This prevents users without cluster scope list priveleges from seeing subscriptions on this page, which makes the uninstall action unavailable.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Install an namespaced operator 2. Log in as a user with project admin permissions where the operator was installed 3. Visit the installed operators page 4. Click the kebab menu for the operator from step 1
Actual results:
The only action available is to delete the CSV
Expected results:
The "Uninstall Operator" and "Edit Subscriptions" actions should show since the user has permission to view, edit, delete Subscription resources in this namespace.
Additional info:
Description of problem:
Remove changing the image name for a MachineSet if ClusterOSImage is set Terraform has already created an image bucket based on OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE for us. So worker nodes should not use OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE directly and instead use the image bucket.
Version-Release number of selected component (if applicable):
current master branch
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When creating a pod controller (e.g. deployment) with pod spec that will be mutated by SCCs, the users might still get a warning about the pod not meeting given namespace pod security level.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
100%
Steps to Reproduce:
1. create a namespace with restricted PSa warning level (the default) 2. create a deployment with a pod with an empty security context
Actual results:
You get a warning about the deployment's pod not meeting the NS's pod security admission requirements.
Expected results:
No warning if the pod for the deployment would be properly mutated by SCCs in order to fulfill the NS's pod security requirements.
Additional info:
originally implemented as a part of https://issues.redhat.com/browse/AUTH-337
The agent integration tests is failing with different errors when run multiple times locally:
Local Run 1:
level=fatal msg=failed to fetch Agent Installer PXE Files: failed to fetch dependency of "Agent Installer PXE Files": failed to generate asset "Agent Installer Artifacts": lstat /home/rwsu/.cache/agent/files_cache/libnmstate.so.2: no such file or directory [exit status 1] FAIL: testdata/agent/pxe/configurations/sno.txt:3: unexpected command failure
Local Run 2:
level=fatal msg=failed to fetch Agent Installer PXE Files: failed to fetch dependency of "Agent Installer PXE Files": failed to generate asset "Agent Installer Artifacts": file /usr/bin/agent-tui was not found [exit status 1] FAIL: testdata/agent/pxe/configurations/sno.txt:3: unexpected command failure
In the [CI|https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/7299/pull-ci-openshift-installer-master-agent-integration-tests/1677347591739674624,] it has failed in this PR multiple times with this error:
level=fatal msg=failed to fetch Agent Installer PXE Files: failed to fetch dependency of "Agent Installer PXE Files": failed to generate asset "Agent Installer Artifacts": lstat /.cache/agent/files_cache/agent-tui: no such file or directory 32 [exit status 1] 33 FAIL: testdata/agent/pxe/configurations/sno.txt:3: unexpected command failure
I believe the issue is the integration tests are running in parallel, and the extractFileFromImage function in pkg/asset/agent/image/oc.go problematic because the cache is being cleared and then files extracted to the same path. When the tests run in parallel, another test could clear the cached files and when the current test tries to read the file from the cached directory, it has disappeared.
Adding
-parallel 1
to ./hack/go-integration-test.sh eliminates the errors, so that why I think it is an concurrency issue.
If the cluster enters the installing-pending-user-action state in assisted-service, it will not recover absent user action.
One way to reproduce this is to have the wrong boot order set in the host, so that it reboots into the agent ISO again instead of the installed CoreOS on disk. (I managed this in dev-scripts by setting a root device hint that pointed to a secondary disk, and only creating that disk once the VM was up. This does not add the new disk to the boot order list, and even if you set it manually it does not take effect until after a full shutdown of the VM - the soft reboot doesn't count.)
Currently we report:
cluster has stopped installing... working to recover installation
in a loop. This is not accurate (unlike in e.g. the install-failed state) - it cannot be recovered automatically.
Also we should only report this, or any other, status once when the status changes, and not continuously in a loop.
Description of problem:
Install failed with External platform type
Version-Release number of selected component (if applicable):
4.14.0-0.ci-2023-03-07-170635 as there is no available 4.14 nightly build, so use the ci build
How reproducible:
Always
Steps to Reproduce:
1.Set up a UPI vsphere cluster with platform set to External 2.Install failed liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 141m Unable to apply 4.14.0-0.ci-2023-03-07-170635: the cluster operator cloud-controller-manager is not available liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.0-0.ci-2023-03-07-170635 True False False 118m baremetal 4.14.0-0.ci-2023-03-07-170635 True False False 137m cloud-controller-manager 4.14.0-0.ci-2023-03-07-170635 cloud-credential 4.14.0-0.ci-2023-03-07-170635 True False False 140m cluster-autoscaler 4.14.0-0.ci-2023-03-07-170635 True False False 137m config-operator 4.14.0-0.ci-2023-03-07-170635 True False False 139m console 4.14.0-0.ci-2023-03-07-170635 True False False 124m control-plane-machine-set 4.14.0-0.ci-2023-03-07-170635 True False False 137m csi-snapshot-controller 4.14.0-0.ci-2023-03-07-170635 True False False 138m dns 4.14.0-0.ci-2023-03-07-170635 True False False 137m etcd 4.14.0-0.ci-2023-03-07-170635 True False False 137m image-registry 4.14.0-0.ci-2023-03-07-170635 True False False 127m ingress 4.14.0-0.ci-2023-03-07-170635 True False False 126m insights 4.14.0-0.ci-2023-03-07-170635 True False False 132m kube-apiserver 4.14.0-0.ci-2023-03-07-170635 True False False 134m kube-controller-manager 4.14.0-0.ci-2023-03-07-170635 True False False 136m kube-scheduler 4.14.0-0.ci-2023-03-07-170635 True False False 135m kube-storage-version-migrator 4.14.0-0.ci-2023-03-07-170635 True False False 138m machine-api 4.14.0-0.ci-2023-03-07-170635 True False False 137m machine-approver 4.14.0-0.ci-2023-03-07-170635 True False False 138m machine-config 4.14.0-0.ci-2023-03-07-170635 True False False 136m marketplace 4.14.0-0.ci-2023-03-07-170635 True False False 137m monitoring 4.14.0-0.ci-2023-03-07-170635 True False False 124m network 4.14.0-0.ci-2023-03-07-170635 True False False 139m node-tuning 4.14.0-0.ci-2023-03-07-170635 True False False 137m openshift-apiserver 4.14.0-0.ci-2023-03-07-170635 True False False 132m openshift-controller-manager 4.14.0-0.ci-2023-03-07-170635 True False False 138m openshift-samples 4.14.0-0.ci-2023-03-07-170635 True False False 131m operator-lifecycle-manager 4.14.0-0.ci-2023-03-07-170635 True False False 138m operator-lifecycle-manager-catalog 4.14.0-0.ci-2023-03-07-170635 True False False 138m operator-lifecycle-manager-packageserver 4.14.0-0.ci-2023-03-07-170635 True False False 132m service-ca 4.14.0-0.ci-2023-03-07-170635 True False False 138m storage 4.14.0-0.ci-2023-03-07-170635 True False False 138m liuhuali@Lius-MacBook-Pro huali-test % oc get infrastructure cluster -oyaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-03-08T07:46:07Z" generation: 1 name: cluster resourceVersion: "527" uid: 096a54bc-8a35-4071-b750-cfac439c1916 spec: cloudConfig: name: "" platformSpec: external: platformName: vSphere type: External status: apiServerInternalURI: https://api-int.huliu-vs8x.qe.devcluster.openshift.com:6443 apiServerURL: https://api.huliu-vs8x.qe.devcluster.openshift.com:6443 controlPlaneTopology: HighlyAvailable etcdDiscoveryDomain: "" infrastructureName: huliu-vs8x-fk79b infrastructureTopology: HighlyAvailable platform: External platformStatus: external: {} type: External liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
Install failed. the cluster operator cloud-controller-manager is not available
Expected results:
Install successfully
Additional info:
This if for testing https://issues.redhat.com/browse/OCPCLOUD-1772
Currently assisted installer doesn't verify that etcd is ok before reboot on the bootstrap node as wait_for_ceo in bootkube does nothing.
In 4.13 and backported to 4.12 etcd team had added status that we can check in assisted installer in order to decide if it is safe to reboot bootstrap or not. We should check it before running shutdown command.
We want to parametrize envoy configmap name: with that, we can configure a private envoy configuration that would bring the following advantages:
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/36
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
If a JSON schema used in by a chart contains unknown value format (non-standard JSON Schema but valid in OpenAPI spec for example), Helm form view hangs on validation and stays in "submitting" state.
As per JSON Schema standard the "format" keyword should only take an advisory role (like an annotation) and should not affect validation.
https://json-schema.org/understanding-json-schema/reference/string.html#format
Verified against 4.13, but probably applies to others.
100%
1. Go to Helm tab.
2. Click create in top right and select Repository
3. Paste following into YAML view and click Create:
apiVersion: helm.openshift.io/v1beta1 kind: ProjectHelmChartRepository metadata: name: reproducer spec: connectionConfig: url: 'https://raw.githubusercontent.com/tumido/helm-backstage/repo-multi-schema2'
4. Go to the Helm tab again (if redirected elsewhere)
5. Click create in top right and select Helm Release
6. In catalog filter select Chart repositories: Reproducer
7. Click on the single tile available (Backstage) and click Create
8. Switch to Form view
9. Leave default values and click Create
10. Stare at the always loading screen that never proceeds further.
Actual results:
And never finishes or displays any error in UI.
Unknown format should not result in rejected validation. JSON Schema standard says that formats should not be used for validation.
This is not a schema violation by itself since Helm itself is happy about it and doesn't complain. The same chart can be successfully deployed via the YAML view.
See this component readiness page.
test=[sig-cluster-lifecycle] cluster upgrade should complete in 105.00 minutes
Appears to indicate we're now taking longer than 105 minutes about 7% of the time, previously never.
Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1694547497553699
wking points out it may be a one time ovn IC thing. Find out what's up and route to appropriate team.
Description of problem:
Multiple instances of tabs under ODF dashboard is seen and sometimes it also shows 404 error when each such tab is selected and the page is re-loaded https://bugzilla.redhat.com/show_bug.cgi?id=2124829
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
We faced an issue where the quota was reached for VPCE. This is visible in the status of AWSEndpointService
- lastTransitionTime: "2023-03-01T10:23:08Z" message: 'failed to create vpc endpoint: VpcEndpointLimitExceeded' reason: AWSError status: "False" type: EndpointAvailable
but it should be propagated to the HC as it blocks worker creation (ignition was not working) and for better visibility.
Description of problem:
This is a follow-up on https://bugzilla.redhat.com/show_bug.cgi?id=2083087 and https://github.com/openshift/console/pull/12390
When creating a Deployment, DeploymentConfig, or Knative Service with enabled Pipeline, and then deleting it again with the enabled option "Delete other resources created by console" (only available on 4.13+ with the PR above) the automatically created Pipeline is not deleted.
When the user tries to create the same resource with a Pipeline again this fails with an error:
An error occurred
secrets "nodeinfo-generic-webhook-secret" already exists
Version-Release number of selected component (if applicable):
4.13
(we might want to backport this together with https://github.com/openshift/console/pull/12390 and OCPBUGS-5547)
How reproducible:
Always
Steps to Reproduce:
Actual results:
Case 1: Delete resources:
Case 2: Delete application:
Expected results:
Case 1: Delete resource:
Case 2: Delete application:
Additional info:
Description of problem:
For HOSTEDCP-1062 , components without a label `hypershift.openshift.io/need-management-kas-access: "true"` can not access the management cluster KAS resources. But for `kube-apiserver` in HCP, there isn't the targe label `hypershift.openshift.io/need-management-kas-access: "true"` but it can access the mgmt KAS jiezhao-mac:hypershift jiezhao$ oc get pods -n clusters-jie-test | grep kube-apiserver kube-apiserver-6799b6cfd8-wk8pv 3/3 Running 0 178m jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get pods kube-apiserver-6799b6cfd8-wk8pv -n clusters-jie-test -o yaml | grep hypershift.openshift.io/need-management-kas-access jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc -n clusters-jie-test rsh pod/kube-apiserver-6799b6cfd8-wk8pv curl --connect-timeout 2 -Iks https://10.0.142.255:6443 -v Defaulted container "apply-bootstrap" out of: apply-bootstrap, kube-apiserver, audit-logs, init-bootstrap (init), wait-for-etcd (init) * Rebuilt URL to: https://10.0.142.255:6443/ .. < HTTP/2 403 HTTP/2 403 ... < * Connection #0 to host 10.0.142.255 left intact
How reproducible:
refer test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-65141
Steps to Reproduce:
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-65141
Additional info:
router pod has the label and can access mgmt KAS. My expectation is that router pod shouldn't have the label and shouldn't access mgmt KAS.
$ oc get pods router-667cb7f844-lx8mv -n clusters-jie-test -o yaml | grep hypershift.openshift.io/need-management-kas-access hypershift.openshift.io/need-management-kas-access: "true" jiezhao-mac:hypershift jiezhao$ oc -n clusters-jie-test rsh pod/router-667cb7f844-lx8mv curl --connect-timeout 2 -Iks https://10.0.142.255:6443 -v Rebuilt URL to: https://10.0.142.255:6443/ Trying 10.0.142.255... ... < HTTP/2 403 HTTP/2 403
> Actually, router doesn't need it anymore after https://github.com/openshift/hypershift/pull/2778
Description of the problem:
Adding invalid label (key or value) to a node returns error code 500 "Internal Server Error", instead of 400
"Bad Request"
How reproducible:
100%
Steps to reproduce:
1. Create a cluster
2. Boot node from ISO
3. Add invalid label, invalid key or value
e.g:
curl -s -H 'Content-Type: application/json' -X PATCH -d '{"node_labels": [{"key": "Label-1", "value": "Label1*1"},{"key": "worker.label2", "value": "Label-2"}]}' https://api.stage.openshift.com/api/assisted-install/v2/infra-envs/8603fe29-e67f-49ad-8ba7-7a256bcb3923/hosts/af629f1e-da67-4211-97f0-f27cb10471ff --header "Authorization: Bearer $(ocm token)"
Actual results:
Action failed with error code 500
{"code":"500","href":"","id":500,"kind":"Error","reason":"node_labels: Invalid value: \"Label1*1\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')"}
Expected results:
Action failed with error code 400
Description of problem:
Noticed an issue with the ignition server when testing some of the latest HO updates on our older control planes:
❯ oc logs ignition-server-5fd4c89764-bddss -n master-roks-dev-4-9 Defaulted container "ignition-server" out of: ignition-server, fetch-feature-gate (init) Error: unknown flag: --feature-gate-manifest
This seems to be thrown because that flag doesn't exist within the ignition server source code for previous control plane versions--we're specifically only seeing this in 4.9 and 4.10, where the ignition server was not being managed by CPO.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Install HO off main 2. Bring up 4.9/4.10 hosted control planes 3. Ignition server crashes
Actual results:
Ignition server crashes
Expected results:
Ignition server to run without issues
Additional info:
This is a clone of issue OCPBUGS-18246. The following is the description of the original issue:
—
Description of problem:
Role assignment for Azure AD Workload Identity performed by ccoctl does not provide an option to scope role assignments to a resource group containing customer vnet in a byo vnet installation workflow. https://docs.openshift.com/container-platform/4.13/installing/installing_azure/installing-azure-vnet.html
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
100%
Steps to Reproduce:
1. Create Azure resource group and vnet for OpenShift within that resource group. 2. Create Azure AD Workload Identity infrastructure with ccoctl. 3. Follow steps to configure existing vnet for installation setting networkResourceGroupName within the install config. 4. Attempt cluster installation.
Actual results:
Cluster installation fails.
Expected results:
Cluster installation succeeds.
Additional info:
ccoctl must be extended to accept a parameter specifying the network resource group name and scope relevant component role assignments to the network resource group in addition to the installation resource group.
Kubernetes 1.27 changes validation of CSR for non-RSA kubelet client/serving CSRs, see https://github.com/kubernetes/kubernetes/issues/109077 and the PR changing https://github.com/kubernetes/kubernetes/pull/111660.
For that reason our machine-config-approver needs to relax the validation in https://github.com/openshift/cluster-machine-approver/blob/d74f42bb37c4130ae1e91819d90ad40a51ec472b/pkg/controller/csr_check.go#L84-L86 such that it appropriately expects the necessary key usage.
Description of problem:
When installing a HyperShift cluster into ap-southeast-3 (currently only availble in the production environment), the install will never succeed due to the hosted KCM pods stuck in CrashLoopBackoff
Version-Release number of selected component (if applicable):
4.12.18
How reproducible:
100%
Steps to Reproduce:
1. Install a HyperShift Cluster in ap-southeast-3 on AWS
Actual results:
kube-controller-manager-54fc4fff7d-2t55x 1/2 CrashLoopBackOff 7 (2m49s ago) 16m kube-controller-manager-54fc4fff7d-dxldc 1/2 CrashLoopBackOff 7 (93s ago) 16m kube-controller-manager-54fc4fff7d-ww4kv 1/2 CrashLoopBackOff 7 (21s ago) 15m With selected "important" logs: I0606 15:16:25.711483 1 event.go:294] "Event occurred" object="kube-system/kube-controller-manager" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="LeaderElection" message="kube-controller-manager-54fc4fff7d-ww4kv_6dbab916-b4bf-447f-bbb2-5037864e7f78 became leader" I0606 15:16:25.711498 1 event.go:294] "Event occurred" object="kube-system/kube-controller-manager" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="kube-controller-manager-54fc4fff7d-ww4kv_6dbab916-b4bf-447f-bbb2-5037864e7f78 became leader" W0606 15:16:25.741417 1 plugins.go:132] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will be removed in a future release. Please use https://github.com/kubernetes/cloud-provider-aws I0606 15:16:25.741763 1 aws.go:1279] Building AWS cloudprovider F0606 15:16:25.742096 1 controllermanager.go:245] error building controller context: cloud provider could not be initialized: could not init cloud provider "aws": not a valid AWS zone (unknown region): ap-southeast-3a
Expected results:
The KCM pods are Running
Description of problem:
Credentials secret generated by CCO on STS Manual Mode cluster does not have status
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
4.14.0
Steps to Reproduce:
1. Create a Manual mode, STS cluster in AWS. 2. Create a CredentialsRequest which provides .spec.cloudTokenPath and .spec.providerSpec.stsIAMRoleARN. 3. Observe that secret is created by CCO in the target namespace specified by the CredentialsRequest. 4. Observe that the CredentialsRequest does not set status once the secret is generated. Specifically, the CredentialsRequest does not set .status.provisioned == true.
Actual results:
Status is not set on CredentialsRequest with provisioned secret.
Expected results:
Status is set on CredentialsRequest with provisioned secret.
Additional info:
Reported by Jan Safranek when testing integration with the aws-efs-csi-driver-operator.
Description of problem: When running in development mode [1], the Loaded enabled plugin count numbers in the Cluster Dashboard Dynamic Plugins popover may be incorrect. In order to make the experience less confusing for users working with the console in development mode, we need to:
Note there is additional work planned in https://issues.redhat.com/browse/CONSOLE-3185. This bug is intended to only capture improving the experience for development mode.
In the assisted pod I see data collection is enabled:
sh-4.4$ env | grep DATA
DATA_UPLOAD_ENDPOINT=https://console.redhat.com/api/ingress/v1/upload
ENABLE_DATA_COLLECTION=True
On https://issues.redhat.com/browse/RFE-2273 the customer analyzed quite correctly:
I have re-reviewed all of the provided data from the attached cases (DHL and ANZ) and have documented my findings below:
1) It looks like the request mentioned by the customer is sent to the Console API. Specifically `api/prometheus-tenancy/api/v1/*`
2) This is then forwarded to Cluster Monitoring (Thanos Querier) [0]
3) Thanos is configured to set the CORS headers to `*` due to the absence of the `--web.disable-cors` argument.[1]
4) The Thanos deployment is managed by the Cluster Monitoring Operator directly [2]
5) When using Postman, we can see the endpoint respond with a `access-control-allow-origin: *` [see image 1]
6) Manually setting the `--web.disable-cors` argument inside the Thanos Querier deployment, the `access-control-allow-origin: *` is removed.
7) Changing the Cluster Monitoring Operator deployment template[4] to include the flag and push the custom image into an OCP 4.10.31 cluster [3]
8) Seems like everything is working and the endpoint is not longer returning the CORS header. [see image 2]
We should set {}web.disable-cors{-} for our thanos deployment. We don't load any cross-origin resources through the console>thanos querier path, so this should just work.
Description of the problem:
Base domain contains double `–` like cat–rahul.com allowed by UI and BE and when node discovered , network validation fails.
Current domain is a private case for using – but note that UI and BE allows to send many – chars as part of domain name.
from agent logs:
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Creating execution step for ntp-synchronizer ntp-synchronizer-70565cf4 args <[{\"ntp_source\":\"\"}]>" file="step_processor.go:123" request_id=5467e025-2683-4119-a55a-976bb7787279 Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Creating execution step for domain-resolution domain-resolution-f3917dea args <[{\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}]>" file="step_processor.go:123" request_id=5467e025-2683-4119-a55a-976bb7787279 Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating domain resolution with args [{\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}]" file="action.go:29" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating inventory with args [fea3d7b9-a990-48a6-9a46-4417915072b0]" file="action.go:29" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=error msg="Failed to validate domain resolution: data, {\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}" file="action.go:42" error="validation failure list:\nvalidation failure list:\ndomains.0.domain_name in body should match '^([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*[.])+[a-zA-Z]{2,}[.]?$'" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating ntp synchronizer with args [{\"ntp_source\":\"\"}]" file="action.go:29" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating free addresses with args [[\"192.168.123.0/24\"]]" file="action.go:29" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- sh -c cp /etc/mtab /root/mtab-fea3d7b9-a990-48a6-9a46-4417915072b0 && podman run --privileged --pid=host --net=host --rm --quiet -v /var/log:/var/log -v /run/udev:/run/udev -v /dev/disk:/dev/disk -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/host/var/log:ro -v /proc/meminfo:/host/proc/meminfo:ro -v /sys/kernel/mm/hugepages:/host/sys/kernel/mm/hugepages:ro -v /proc/cpuinfo:/host/proc/cpuinfo:ro -v /root/mtab-fea3d7b9-a990-48a6-9a46-4417915072b0:/host/etc/mtab:ro -v /sys/block:/host/sys/block:ro -v /sys/devices:/host/sys/devices:ro -v /sys/bus:/host/sys/bus:ro -v /sys/class:/host/sys/class:ro -v /run/udev:/host/run/udev:ro -v /dev/disk:/host/dev/disk:ro registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-agent-rhel8:v1.0.0-279 inventory]" file="execute.go:39" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=error msg="Unable to create runner for step <domain-resolution-f3917dea>, args <[{\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}]>" file="step_processor.go:126" error="validation failure list:\nvalidation failure list:\ndomains.0.domain_name in body should match '^([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*[.])+[a-zA-Z]{2,}[.]?$'" request_id=5467e025-2683-4119-a55a-976bb7787279 Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- findmnt --raw --noheadings --output SOURCE,TARGET --target /run/media/iso]" file="execute.go:39" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- sh -c podman ps --format '{{.Names}}' | grep -q '^free_addresses_scanner$' || podman run --privileged --net=host --rm --quiet --name free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-agent-rhel8:v1.0.0-279 free_addresses '[\"192.168.123.0/24\"]']" file="execute.go:39" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- timeout 30 chronyc -n sources]" file="execute.go:39" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=warning msg="Sending step <domain-resolution-f3917dea> reply output <> error <validation failure list:\nvalidation failure list:\ndomains.0.domain_name in body should match '^([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*[.])+[a-zA-Z]{2,}[.]?$'> exit-code <-1>" file="step_processor.go:76" request_id=5467e025-2683-4119-a55a-976bb7787279
How reproducible:
Create a cluster with domain cat–rahul.com with UI fix that allowing it.
Once node discovered , network validation fails on :
Steps to reproduce:
see above
Actual results:
Unable to install cluster due to network validation failure
Expected results:
The domain should be allowed in regex
Description of problem:
When modifying a secret in the Management Console that has a binary file inclued (such as a keystore), the keystore will get corrupted post the modification and therefore impact application functionality (as the keystore can not be read). $ openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365 $ cat cert.pem key.pem > file.crt.txt $ openssl pkcs12 -export -in file.crt.txt -out mykeystore.pkcs12 -name myAlias -noiter -nomaciter $ oc create secret generic keystore --from-file=mykeystore.pkcs12 --from-file=cert.pem --from-file=key.pem -n project-300 apiVersion: v1 kind: Pod metadata: name: mypod namespace: project-300 spec: containers: - name: mypod image: quay.io/rhn_support_sreber/curl:latest volumeMounts: - name: foo mountPath: "/keystore" readOnly: true volumes: - name: foo secret: secretName: keystore optional: true # Getting the md5sum from the file on the local Laptop to compare with what is available in the pod $ md5sum mykeystore.pkcs12 c189536854e59ab444720efaaa76a34a mykeystore.pkcs12 sh-5.2# ls -al /keystore/..data/ total 16 drwxr-xr-x. 2 root root 100 Mar 24 11:19 . drwxrwxrwt. 3 root root 140 Mar 24 11:19 .. -rw-r--r--. 1 root root 1992 Mar 24 11:19 cert.pem -rw-r--r--. 1 root root 3414 Mar 24 11:19 key.pem -rw-r--r--. 1 root root 4380 Mar 24 11:19 mykeystore.pkcs12 sh-5.2# md5sum /keystore/..data/mykeystore.pkcs12 c189536854e59ab444720efaaa76a34a /keystore/..data/mykeystore.pkcs12 sh-5.2# Edit cert.pem in secret using the Management Console $ oc delete pod mypod -n project-300 apiVersion: v1 kind: Pod metadata: name: mypod namespace: project-300 spec: containers: - name: mypod image: quay.io/rhn_support_sreber/curl:latest volumeMounts: - name: foo mountPath: "/keystore" readOnly: true volumes: - name: foo secret: secretName: keystore optional: true sh-5.2# ls -al /keystore/..data/ total 20 drwxr-xr-x. 2 root root 100 Mar 24 12:52 . drwxrwxrwt. 3 root root 140 Mar 24 12:52 .. -rw-r--r--. 1 root root 1992 Mar 24 12:52 cert.pem -rw-r--r--. 1 root root 3414 Mar 24 12:52 key.pem -rw-r--r--. 1 root root 10782 Mar 24 12:52 mykeystore.pkcs12 sh-5.2# md5sum /keystore/..data/mykeystore.pkcs12 56f04fa8059471896ed5a3c54ade707c /keystore/..data/mykeystore.pkcs12 sh-5.2# $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2023-03-23-204038 True False 91m Cluster version is 4.13.0-0.nightly-2023-03-23-204038 The modification was done in the Management Console, selecting the secret and then use: Actions -> Edit Secrets -> Modifying the value of cert.pem and submiting via Save button
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.13.0-0.nightly-2023-03-23-204038 and 4.12.6
How reproducible:
Always
Steps to Reproduce:
1. See above the details steps
Actual results:
# md5sum on the Laptop for the file $ md5sum mykeystore.pkcs12 c189536854e59ab444720efaaa76a34a mykeystore.pkcs12 # md5sum of the file in the pod after the modification in the Management Console sh-5.2# md5sum /keystore/..data/mykeystore.pkcs12 56f04fa8059471896ed5a3c54ade707c /keystore/..data/mykeystore.pkcs12 The file got corrupted and is not usable anymore. The binary file though should not be modified if no changes was made on it's value, when editing the secret in the Mansgement Console.
Expected results:
The binary file though should not be modified if no changes was made on it's value, when editing the secret in the Mansgement Console.
Additional info:
A similar problem was alredy fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1879638 but that was, when the binary file was uploaded. Possible that the secret edit functionality is also missing binary file support.
Improve logging format of KNI haproxy logs to display tcplogs + frondend IP and frontend port.
The current logging format is not very verbose:
<134>Jun 2 22:54:02 haproxy[11]: Connect from ::1:42424 to ::1:9445 (main/TCP) <134>Jun 2 22:54:04 haproxy[11]: Connect from ::1:42436 to ::1:9445 (main/TCP) <134>Jun 2 22:54:04 haproxy[11]: Connect from ::1:42446 to ::1:9445 (main/TCP)
It lacks critical information for troubleshooting, such as load-balancing destination and timestamps.
https://www.haproxy.com/blog/introduction-to-haproxy-logging recommends the following for tcp mode:
When in TCP mode, which is set by adding mode tcp, you should also add [option tcplog](https://www.haproxy.com/documentation/hapee/1-8r1/onepage/#option%20tcplog).
This fix contains the following changes coming from updated version of kubernetes up to v1.27.6:
Changelog:
v1.27.6: https://github.com/kubernetes/kubernetes/blob/release-1.27/CHANGELOG/CHANGELOG-1.27.md#changelog-since-v1275
v1.27.5: https://github.com/kubernetes/kubernetes/blob/release-1.27/CHANGELOG/CHANGELOG-1.27.md#changelog-since-v1274
Please review the following PR: https://github.com/operator-framework/operator-marketplace/pull/535
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
CSI storage capacity tracking is GA since Kubernetes 1.24, yet must-gather does not collect CSIStorageCapacity objects. It would be useful for single node clusters with LVMO, but other clusters could benefit from it too.
Version-Release number of selected component (if applicable):
4.11.0
How reproducible:
always
Steps to Reproduce:
1. oc adm must-gather
Actual results:
Output does not contain CSIStorageCapacity objects
Expected results:
Output contains CSIStorageCapacity objects
Additional info:
We should go through all new additions to storage APIs (storage.k8s.io/v1) and any missing items.
Description of problem:
CNO panics with net/http: abort Handler while installing SNO cluster on OpenshiftSDN network 4.14.0-0.nightly-2023-07-05-191022 True False True 9h Panic detected: net/http: abort Handler
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-05-191022
How reproducible:
sometimes
Steps to Reproduce:
1.Install OpenshiftSDN cluster on SNO 2. 3.
Actual results:
Cluster (CNO) reports errors
Expected results:
Cluster should be installed fine
Additional info:
SOS: http://shell.lab.bos.redhat.com/~anusaxen/sosreport-rg-0707-tl6fd-master-0-2023-07-07-pyaruar.tar.xz MG: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.4340060474822893433/
Hypershift needs to be able to specify a different release payload for control plane components without redeploying anything in the hosted cluster.
csi-driver-node DaemonSet pods in the hosted cluster and the csi-driver-controller Deployment that runs in the control plane both use the AWS_EBS_DRIVER_IMAGE and LIVENESS_PROBE_IMAGE
We need a way to specify these images separately for csi-driver-node and csi-driver-controller.
Description of problem:
Even in environments when containers are manually loaded into containers-store, services will fail because they are written to always pull images priory to starting the container (or checking podman image to see if the image exists first).
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Business Automation Operands fail to load in uninstall operator modal. With "Cannot load Operands. There was an error loading operands for this operator. Operands will need to be deleted manually..." alert message. "Delete all operand instances for this operator__checkbox" is not shown so the test fails. https://search.ci.openshift.org/?search=Testing+uninstall+of+Business+Automation+Operator&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The kube-controller-manager container cluster-policy-controller will show unusual error logs ,such as " I0214 10:49:34.698154 1 interface.go:71] Couldn't find informer for template.openshift.io/v1, Resource=templateinstances I0214 10:49:34.698159 1 resource_quota_monitor.go:185] QuotaMonitor unable to use a shared informer for resource "template.openshift.io/v1, Resource=templateinstances": no informer found for template.openshift.io/v1, Resource=templateinstances "
Version-Release number of selected component (if applicable):
How reproducible:
when the cluster-policy-controller restart ,u will see these logs
Steps to Reproduce:
1.oc logs kube-controller-manager-master0 -n openshift-kube-controller-manager -c cluster-policy-controller
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1042
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The e2e test "TestDNSLogging" from https://github.com/openshift/cluster-dns-operator/tree/master/test/e2e fails intermittently.
Recently seen in:
Description of problem:
nmstate packages > 2.2.9 will cause MCD firstboot to fail. For now, let's pin the nmstate version and fix properly via https://github.com/openshift/machine-config-operator/pull/3720
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
No datapoints found for Long Running Requests by Resource and Long Running Requests by Instance of "API Performance" dashboard on web-console UI
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-13-223353
How reproducible:
always
Steps to Reproduce:
1.Installed OCP cluster with 4.14 nightly payload 2.Open the web-console, view the page "API Performance" dashboard on web-console UI
Actual results:
1.On the Long Running Requests by Resource and Long Running Requests by Instance page, shows No datapoints found
Expected results:
2.Should show something on Long Running Requests by Resource and Long Running Requests by Instance pages.
Additional info:
1. Got the same results on 4.13. 2. Not found the apiserver_longrunning_gauge in prometheus data, only apiserver_longrunning_requests $ token=`oc create token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep apiserver_longrunning_gauge no result $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep apiserver_long "apiserver_longrunning_requests",
Description of problem:
In assisted-installer flow bootkube service is started on Live ISO, so root FS is read-only. OKD installer attempts to pivot the booted OS to machine-os-content via `rpm-ostree rebase`. This is not necessary since we're already using SCOS in Live ISO.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Print preview of Topology presents incorrect layout
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Always
Steps to Reproduce:
1. Have 2 KNative/Serverless Functions deployed (in my case 1 is Quarkus and another is Spring Boot) 2. In Topology UI observe you see their snippets properly within Graph view are 3. Now switch to List view. 4. In my case items I see in List view are such short list of my items: Broker default Operator Backed Service DW terminal-avby87 D workspaceb5975d64dbc54983 Service KSVC caller-function REV caller-function-00002 Service KSVC callme-function REV callme-function-00001 5. Now using Chrome browser click Ctrl+P, i.e. Print preview 6. Observe that even in Landscape mode only till workspace item is displayed and no more pages/info.
Actual results:
Incomplete Topology info from List view in Print Preview
Expected results:
Full and accurate Topology info from List view in Print Preview
Additional info:
Description of problem: Multus should implement per node certificates via integration in the CNO
Description of problem:
When installing a new cluster with TechPreviewNoUpgrade featureSet, Nodes never become Ready. Logs from control-plane components indicate that a resource associated with the DynamicResourceAllocation feature can't be found: E0804 15:48:51.094383 1 reflector.go:147] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1alpha2.PodSchedulingContext: failed to list *v1alpha2.PodSchedulingContext: the server could not find the requested resource (get podschedulingcontexts.resource.k8s.io) It turns out we either need to: 1. Enable the resource.k8s.io/v1alpha2=true API in kube-apiserver. 2. Or disable the DynamicResourceAllocation feature as TP. For now I added a commit to invalidate this feature in o/k and disable all related tests. Please let me know once this is sorted out so that I can drop that commit from the rebase PR.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always when installing a new cluster with TechPreviewNoUpgrade featureSet.
Steps to Reproduce:
1. Install cluster with TechPreviewNoUpgrade featureSet (this can be done passing an install-config.yaml to the installer). 2. Check logs from one the control-plane components.
Actual results:
Nodes are NotReady and ClusterOperators Degraded.
Expected results:
Cluster is installed successfully.
Additional info:
Slack thread: https://redhat-internal.slack.com/archives/C05HQGU8TFF/p1691154653507499 How to enable an API in KAS: https://kubernetes.io/docs/tasks/administer-cluster/enable-disable-api/
When making a change to the uninstaller for GCP, the linter picked up an error:
pkg/destroy/gcp/gcp.go:42:2: found a struct that contains a context.Context field (containedctx) Context context.Context
Contexts should not be added to structs. Instead the context should be created at the top level of the uninstaller OR a separate context can be used for each stage of the uninstallation process.
Currently this error can be bypassed by adding:
//nolint:containedctx
to the offending line
Description of problem:
We need to update the operator to be synced with the K8 api version used by OCP 4.14. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Should update with --include-local-oci-catalogs for --oci-registries-config's help info --oci-registries-config string Registries config file location (used only with --use-oci-feature flag) Now the `--use-oci-feature` has been deprecated, please replace with --include-local-oci-catalogs for the help information.
Description of problem:
After updating the sysctl config map, the test waits up to 30s for the pod to be in ready state. From the logs, it could be seen that the allowlist controller takes more than 30s to reconcile when multiple tests are running in parallel. The internal logic of the allowlist controller waits up to 60s for the pods of the allowlist DS to be running. Therefore, it is logical to increase the timeout in the test to 60s.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Go to console 2. Click on "Installed Operator" 3. Add operator (Node feature discovery ) 4. Click on all instances that on Create new (see image)
Actual results:
The drop down is empty but the as a user you can click them and get to the new instance yaml
Expected results:
For a better user experince if at least there will be at least some labels or clickable text
Additional info:
Description of problem:
While installing cluster with assisted installer lately we have cases when one of the master joins very quickly and start all needed pods in order for cluster bootstrap to finish but the second one joins only after that. Keepalived can't start if there is only one joined cluster as it doesn't have enough data to build configuration files. In HA mode cluster bootstrap should wait at least for 2 joined masters before removing bootstrap control plane as without it installation with fail.
Version-Release number of selected component (if applicable):
How reproducible:
Start bm installation and start one master, wait till it starts all required pods and then add others.
Steps to Reproduce:
1. Start bm installation 2. Start one master 3. Wait till it starts all required pods. 4. Add others
Actual results:
no vip, installation fails
Expected results:
installation succeeds, vip moves to master
Additional info:
Description of problem:
After a replace upgrade from OCP 4.14 image to another 4.14 image first node is in NotReady. jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig NAME STATUS ROLES AGE VERSION ip-10-0-128-175.us-east-2.compute.internal Ready worker 72m v1.26.2+06e8c46 ip-10-0-134-164.us-east-2.compute.internal Ready worker 68m v1.26.2+06e8c46 ip-10-0-137-194.us-east-2.compute.internal Ready worker 77m v1.26.2+06e8c46 ip-10-0-141-231.us-east-2.compute.internal NotReady worker 9m54s v1.26.2+06e8c46 - lastHeartbeatTime: "2023-03-21T19:48:46Z" lastTransitionTime: "2023-03-21T19:42:37Z" message: 'container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?' reason: KubeletNotReady status: "False" type: Ready Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 11m kubelet Starting kubelet. Normal NodeHasSufficientMemory 11m (x2 over 11m) kubelet Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 11m (x2 over 11m) kubelet Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 11m (x2 over 11m) kubelet Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientPID Normal NodeAllocatableEnforced 11m kubelet Updated Node Allocatable limit across pods Normal Synced 11m cloud-node-controller Node synced successfully Normal RegisteredNode 11m node-controller Node ip-10-0-141-231.us-east-2.compute.internal event: Registered Node ip-10-0-141-231.us-east-2.compute.internal in Controller Warning ErrorReconcilingNode 17s (x30 over 11m) controlplane nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
ovnkube-master log:
I0321 20:55:16.270197 1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation I0321 20:55:16.270209 1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation I0321 20:55:16.270273 1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation I0321 20:55:17.851497 1 master.go:719] Adding or Updating Node "ip-10-0-137-194.us-east-2.compute.internal" I0321 20:55:25.965132 1 master.go:719] Adding or Updating Node "ip-10-0-128-175.us-east-2.compute.internal" I0321 20:55:45.928694 1 client.go:783] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432145 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]" I0321 20:55:46.270129 1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal I0321 20:55:46.270154 1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal I0321 20:55:46.270164 1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal" I0321 20:55:46.270201 1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation I0321 20:55:46.270209 1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation I0321 20:55:46.270284 1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation I0321 20:55:52.916512 1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 5 items received I0321 20:56:06.910669 1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 12 items received I0321 20:56:15.928505 1 client.go:783] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432175 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]" I0321 20:56:16.269611 1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal I0321 20:56:16.269637 1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal I0321 20:56:16.269646 1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal" I0321 20:56:16.269688 1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation I0321 20:56:16.269697 1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation I0321 20:56:16.269724 1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
cluster-network-operator log:
I0321 21:03:38.487602 1 log.go:198] Set operator conditions: - lastTransitionTime: "2023-03-21T17:39:21Z" status: "False" type: ManagementStateDegraded - lastTransitionTime: "2023-03-21T19:53:10Z" message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-03-21T19:42:39Z reason: RolloutHung status: "True" type: Degraded - lastTransitionTime: "2023-03-21T17:39:21Z" status: "True" type: Upgradeable - lastTransitionTime: "2023-03-21T19:42:39Z" message: |- DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes) DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes) DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes) DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) reason: Deploying status: "True" type: Progressing - lastTransitionTime: "2023-03-21T17:39:26Z" status: "True" type: Available I0321 21:03:38.488312 1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged I0321 21:03:38.499825 1 log.go:198] Set ClusterOperator conditions: - lastTransitionTime: "2023-03-21T17:39:21Z" status: "False" type: ManagementStateDegraded - lastTransitionTime: "2023-03-21T19:53:10Z" message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-03-21T19:42:39Z reason: RolloutHung status: "True" type: Degraded - lastTransitionTime: "2023-03-21T17:39:21Z" status: "True" type: Upgradeable - lastTransitionTime: "2023-03-21T19:42:39Z" message: |- DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes) DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes) DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes) DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) reason: Deploying status: "True" type: Progressing - lastTransitionTime: "2023-03-21T17:39:26Z" status: "True" type: Available I0321 21:03:38.571013 1 log.go:198] Set HostedControlPlane conditions: - lastTransitionTime: "2023-03-21T17:38:24Z" message: All is well observedGeneration: 3 reason: AsExpected status: "True" type: ValidAWSIdentityProvider - lastTransitionTime: "2023-03-21T17:37:06Z" message: Configuration passes validation observedGeneration: 3 reason: AsExpected status: "True" type: ValidHostedControlPlaneConfiguration - lastTransitionTime: "2023-03-21T19:24:24Z" message: "" observedGeneration: 3 reason: QuorumAvailable status: "True" type: EtcdAvailable - lastTransitionTime: "2023-03-21T17:38:23Z" message: Kube APIServer deployment is available observedGeneration: 3 reason: AsExpected status: "True" type: KubeAPIServerAvailable - lastTransitionTime: "2023-03-21T20:26:29Z" message: "" observedGeneration: 3 reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2023-03-21T17:37:11Z" message: All is well observedGeneration: 3 reason: AsExpected status: "True" type: InfrastructureReady - lastTransitionTime: "2023-03-21T17:37:06Z" message: External DNS is not configured observedGeneration: 3 reason: StatusUnknown status: Unknown type: ExternalDNSReachable - lastTransitionTime: "2023-03-21T19:24:24Z" message: "" observedGeneration: 3 reason: AsExpected status: "True" type: Available - lastTransitionTime: "2023-03-21T17:37:06Z" message: Reconciliation active on resource observedGeneration: 3 reason: AsExpected status: "True" type: ReconciliationActive - lastTransitionTime: "2023-03-21T17:38:25Z" message: All is well reason: AsExpected status: "True" type: AWSDefaultSecurityGroupCreated - lastTransitionTime: "2023-03-21T19:30:54Z" message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster operator network is degraded' observedGeneration: 3 reason: ClusterOperatorDegraded status: "False" type: ClusterVersionProgressing - lastTransitionTime: "2023-03-21T17:39:11Z" message: Condition not found in the CVO. observedGeneration: 3 reason: StatusUnknown status: Unknown type: ClusterVersionUpgradeable - lastTransitionTime: "2023-03-21T17:44:05Z" message: Done applying 4.14.0-0.nightly-2023-03-20-201450 observedGeneration: 3 reason: FromClusterVersion status: "True" type: ClusterVersionAvailable - lastTransitionTime: "2023-03-21T19:55:15Z" message: Cluster operator network is degraded observedGeneration: 3 reason: ClusterOperatorDegraded status: "True" type: ClusterVersionFailing - lastTransitionTime: "2023-03-21T17:39:11Z" message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450" architecture="amd64" observedGeneration: 3 reason: PayloadLoaded status: "True" type: ClusterVersionReleaseAccepted - lastTransitionTime: "2023-03-21T17:39:21Z" message: "" reason: AsExpected status: "False" type: network.operator.openshift.io/ManagementStateDegraded - lastTransitionTime: "2023-03-21T19:53:10Z" message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-03-21T19:42:39Z reason: RolloutHung status: "True" type: network.operator.openshift.io/Degraded - lastTransitionTime: "2023-03-21T17:39:21Z" message: "" reason: AsExpected status: "True" type: network.operator.openshift.io/Upgradeable - lastTransitionTime: "2023-03-21T19:42:39Z" message: |- DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes) DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes) DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes) DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) reason: Deploying status: "True" type: network.operator.openshift.io/Progressing - lastTransitionTime: "2023-03-21T17:39:27Z" message: "" reason: AsExpected status: "True" type: network.operator.openshift.io/Available I0321 21:03:39.450912 1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status I0321 21:03:39.450953 1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status I0321 21:03:39.493206 1 log.go:198] Set operator conditions: - lastTransitionTime: "2023-03-21T17:39:21Z" status: "False" type: ManagementStateDegraded - lastTransitionTime: "2023-03-21T19:53:10Z" message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-03-21T19:42:39Z reason: RolloutHung status: "True" type: Degraded - lastTransitionTime: "2023-03-21T17:39:21Z" status: "True" type: Upgradeable - lastTransitionTime: "2023-03-21T19:42:39Z" message: |- DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes) DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes) reason: Deploying status: "True" type: Progressing - lastTransitionTime: "2023-03-21T17:39:26Z" status: "True" type: Available I0321 21:03:39.494050 1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged I0321 21:03:39.508538 1 log.go:198] Set ClusterOperator conditions: - lastTransitionTime: "2023-03-21T17:39:21Z" status: "False" type: ManagementStateDegraded - lastTransitionTime: "2023-03-21T19:53:10Z" message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-03-21T19:42:39Z reason: RolloutHung status: "True" type: Degraded - lastTransitionTime: "2023-03-21T17:39:21Z" status: "True" type: Upgradeable - lastTransitionTime: "2023-03-21T19:42:39Z" message: |- DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes) DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes) reason: Deploying status: "True" type: Progressing - lastTransitionTime: "2023-03-21T17:39:26Z" status: "True" type: Available I0321 21:03:39.684429 1 log.go:198] Set HostedControlPlane conditions: - lastTransitionTime: "2023-03-21T17:38:24Z" message: All is well observedGeneration: 3 reason: AsExpected status: "True" type: ValidAWSIdentityProvider - lastTransitionTime: "2023-03-21T17:37:06Z" message: Configuration passes validation observedGeneration: 3 reason: AsExpected status: "True" type: ValidHostedControlPlaneConfiguration - lastTransitionTime: "2023-03-21T19:24:24Z" message: "" observedGeneration: 3 reason: QuorumAvailable status: "True" type: EtcdAvailable - lastTransitionTime: "2023-03-21T17:38:23Z" message: Kube APIServer deployment is available observedGeneration: 3 reason: AsExpected status: "True" type: KubeAPIServerAvailable - lastTransitionTime: "2023-03-21T20:26:29Z" message: "" observedGeneration: 3 reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2023-03-21T17:37:11Z" message: All is well observedGeneration: 3 reason: AsExpected status: "True" type: InfrastructureReady - lastTransitionTime: "2023-03-21T17:37:06Z" message: External DNS is not configured observedGeneration: 3 reason: StatusUnknown status: Unknown type: ExternalDNSReachable - lastTransitionTime: "2023-03-21T19:24:24Z" message: "" observedGeneration: 3 reason: AsExpected status: "True" type: Available - lastTransitionTime: "2023-03-21T17:37:06Z" message: Reconciliation active on resource observedGeneration: 3 reason: AsExpected status: "True" type: ReconciliationActive - lastTransitionTime: "2023-03-21T17:38:25Z" message: All is well reason: AsExpected status: "True" type: AWSDefaultSecurityGroupCreated - lastTransitionTime: "2023-03-21T19:30:54Z" message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster operator network is degraded' observedGeneration: 3 reason: ClusterOperatorDegraded status: "False" type: ClusterVersionProgressing - lastTransitionTime: "2023-03-21T17:39:11Z" message: Condition not found in the CVO. observedGeneration: 3 reason: StatusUnknown status: Unknown type: ClusterVersionUpgradeable - lastTransitionTime: "2023-03-21T17:44:05Z" message: Done applying 4.14.0-0.nightly-2023-03-20-201450 observedGeneration: 3 reason: FromClusterVersion status: "True" type: ClusterVersionAvailable - lastTransitionTime: "2023-03-21T19:55:15Z" message: Cluster operator network is degraded observedGeneration: 3 reason: ClusterOperatorDegraded status: "True" type: ClusterVersionFailing - lastTransitionTime: "2023-03-21T17:39:11Z" message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450" architecture="amd64" observedGeneration: 3 reason: PayloadLoaded status: "True" type: ClusterVersionReleaseAccepted - lastTransitionTime: "2023-03-21T17:39:21Z" message: "" reason: AsExpected status: "False" type: network.operator.openshift.io/ManagementStateDegraded - lastTransitionTime: "2023-03-21T19:53:10Z" message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-03-21T19:42:39Z reason: RolloutHung status: "True" type: network.operator.openshift.io/Degraded - lastTransitionTime: "2023-03-21T17:39:21Z" message: "" reason: AsExpected status: "True" type: network.operator.openshift.io/Upgradeable - lastTransitionTime: "2023-03-21T19:42:39Z" message: |- DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes) DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes) reason: Deploying status: "True" type: network.operator.openshift.io/Progressing - lastTransitionTime: "2023-03-21T17:39:27Z" message: "" reason: AsExpected status: "True" type: network.operator.openshift.io/Available
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. management cluster 4.13 2. bring up the hostedcluster and nodepool in 4.14.0-0.nightly-2023-03-19-234132 3. upgrade the hostedcluster to 4.14.0-0.nightly-2023-03-20-201450 4. replace upgrade the nodepool to 4.14.0-0.nightly-2023-03-20-201450
Actual results
First node is in NotReady
Expected results:
All nodes should be Ready
Additional info:
No issue with replace upgrade from 4.13 to 4.14
Description of problem:
While mirroring nvidia operator with oc-mirror 4.13 version, ImageContentSourcePolicy is not getting created properly
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create imageset file kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 archiveSize: 4 storageConfig: local: path: /home/name/nvidia mirror: operators: - catalog: registry.redhat.io/redhat/certified-operator-index:v4.11 packages: - name: nvidia-network-operator 2. mirror to disk using oc-mirror 4.13 $oc-mirror -c imageset.yaml file:///home/name/nvidia/ ./oc-mirror version Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.13.0-202307242035.p0.gf11a900.assembly.stream-f11a900", GitCommit:"f11a9001caad8fe146c73baf2acc38ddcf3642b5", GitTreeState:"clean", BuildDate:"2023-07-24T21:25:46Z", GoVersion:"go1.19.10 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"} 3. Now generate the manifest $ oc-mirror --from /home/name/nvidia/ docker://registry:8443 --manifests-only - mirrors: - registry:8443/nvidia/cloud-native source: nvcr.io/nvidia However the correct mapping should be: - mirrors: - registry/nvidia source: nvcr.io/nvidia 4. perform same step with 4.12.0 version you will not hit this issue. ./oc-mirror version Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.12.0-202304241542.p0.g5fc00fe.assembly.stream-5fc00fe", GitCommit:"5fc00fe735d8fb3b6125f358f5d6b9fe726fad10", GitTreeState:"clean", BuildDate:"2023-04-24T16:01:29Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}
Actual results:
Expected results:
Additional info:
`useDeleteModal` example is not formatted correctly on https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#example-46 as it is missing the wrapping "```tsx" and "```" markdown.
Description of problem:
Builds navigation item is missing in Developer perspective
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
Always
Steps to Reproduce:
Actual results:
"Builds" is missing as a navigation item below "Search".
Expected results:
"Builds" navigation item should be displayed again when BuildConfigs CRD is available.
Additional info:
Might be dropped with PR https://github.com/openshift/console/pull/13097
Description of problem:
We disabled copies of CSVs in our clusters, the list of the installed operators is visible, but when we go (within the context of some user namespace) to: Developer Catalog -> Operator Backed then the list is empty. When we enable the copies of CSVs, then the operator backed catalog shows the expected items.
Version-Release number of selected component (if applicable):
OpenShift 4.13.1
How reproducible:
every time
Steps to Reproduce:
1. install Camel-k operator (community version, stable channel) 2. Disable copies of CSV by setting 'OLMConfig.spec.features.disableCopiedCSVs' to 'true' 3. create a new namespace/project 4. go to Developer Catalog -> Operator backed
Actual results:
the Operator Backed Catalog is empty
Expected results:
the Operator Backed Catalog should show Camel-K related items
Additional info:
Description of problem:
Dockerfile.fast relies on picking up the `bin` directory built in the host for inclusion in the HyperShift Operator image for development. Containerfile.operator, for RHTAP, relies on .dockerignore to prevent a `/bin` to be present in the podman build context that has permissions that the user `default` (used by the golang build container) can't write to.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1.make docker-build-fast
Actual results:
COPY bin/* /usr/bin/ fails due to bin not being included in the podman build context
Expected results:
The container builds successfully
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-19311. The following is the description of the original issue:
—
As a user, I would like to use the Import from Git form even if I don't have BC installed in my cluster, but I have installed the Pipelines operator.
No QA needed. Current CNO does not pass with newer linter version 1.53.1.
Description of problem:
Jenkins and Jenkins Agent Base image versions needs to be updated to use the latest images to mitigate known CVEs in plugins and Jenkins versions.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-15999.
The PSA changes introduced in 4.12 meant that we had to figure out a way to ensure that customer workloads (3rd-party or otherwise) wouldn't grind to a halt as pods cannot be scheduled due to PSA. The solution found was to have another controller that could introspect a namespace to determine the best pod security standard to apply to the namespace. This controller ignores payload namespaces (usually named openshift-), but will reconcile non-payload openshift- namespaces with a special label applied to it. On the OLM side, we had to create a controller that would apply the psa label sync'er label to non-payload openshift-* namespaces with operators (CSVs) installed in them.
OLM took a dependency on the cluster-policy-controller in order to get the list of payload namespaces. This dependency introduced a few challenges for our CI:
To avoid these issues, and seen as the list probably won't update very frequently, we'll make our own copy of the list and maintain it on this side, as this will be less busy work than the alternative.
Duplicate to use automation since original bug is restricted.
https://issues.redhat.com/browse/OCPBUGS-14022
Description of problem:
On attempting to perform EUS->EUS upgrade from 4.12.z->4.14 (CI builds), I am seeing consistently that after upgrade OCP to 4.14, worker machine configpool goes to degraded state, complaining about {noformat}message: 'Node c01-dbn-412-tzm44-worker-0-7w6wg is reporting: "failed to run nmstatectl: fork/exec /run/machine-config-daemon-bin/nmstatectl: no such file or directory", Node c01-dbn-412-tzm44-worker-0-cmqsl is reporting: "failed to run nmstatectl: fork/exec /run/machine-config-daemon-bin/nmstatectl: no such file or directory", Node c01-dbn-412-tzm44-worker-0-qrp6v is reporting: "failed to run nmstatectl: fork/exec /run/machine-config-daemon-bin/nmstatectl: no such file or directory"' {noformat}. And then clusterversion reports error: {noformat} [cloud-user@ocp-psi-executor dbasunag]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.ci-2023-08-14-110508 True True 125m Unable to apply 4.14.0-0.ci-2023-08-14-152624: wait has exceeded 40 minutes for these operators: machine-config [cloud-user@ocp-psi-executor dbasunag]$ {noformat} This is consistently reproducible in clusters with knmstate installed.
Version-Release number of selected component (if applicable):
4.12.29 -> 4.13.0-0.ci-2023-08-14-110508->4.14.0-0.ci-2023-08-14-152624
How reproducible:
100%
Steps to Reproduce:
1. Perform EUS upgrade on a cluster with CNV, ODF, Knmstate 2. After pausing worker mcp, upgraded OCP, ODF, CNV, KNMstate to 4.13 - everything worked fine 3. After upgrading OCP to 4.14, when master mcp is updated, worker mcp went to degraded state and clusterversion eventually reported error (all the master nodes were updated)
Actual results:
[cloud-user@ocp-psi-executor dbasunag]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.0-0.ci-2023-08-14-152624 True False False 9h baremetal 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h cloud-controller-manager 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h cloud-credential 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h cluster-autoscaler 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h config-operator 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h console 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h control-plane-machine-set 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h csi-snapshot-controller 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h dns 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h etcd 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h image-registry 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h ingress 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h insights 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h kube-apiserver 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h kube-controller-manager 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h kube-scheduler 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h kube-storage-version-migrator 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h machine-api 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h machine-approver 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h machine-config 4.13.0-0.ci-2023-08-14-110508 True True True 2d23h Unable to apply 4.14.0-0.ci-2023-08-14-152624: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)]] marketplace 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h monitoring 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h network 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h node-tuning 4.14.0-0.ci-2023-08-14-152624 True False False 95m openshift-apiserver 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h openshift-controller-manager 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h openshift-samples 4.14.0-0.ci-2023-08-14-152624 True False False 98m operator-lifecycle-manager 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h operator-lifecycle-manager-catalog 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h operator-lifecycle-manager-packageserver 4.14.0-0.ci-2023-08-14-152624 True False False 2d22h service-ca 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h storage 4.14.0-0.ci-2023-08-14-152624 True False False 2d23h [cloud-user@ocp-psi-executor dbasunag]$ [cloud-user@ocp-psi-executor dbasunag]$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-693b054330417fe5e098b58716603fc8 True False False 3 3 3 0 2d23h worker rendered-worker-b2f5a9084e9919b4c1c491658c73bce5 False False True 3 0 0 3 2d23h [cloud-user@ocp-psi-executor dbasunag]$ [cloud-user@ocp-psi-executor dbasunag]$ oc get node NAME STATUS ROLES AGE VERSION c01-dbn-412-tzm44-master-0 Ready control-plane,master 2d23h v1.27.4+deb2c60 c01-dbn-412-tzm44-master-1 Ready control-plane,master 2d23h v1.27.4+deb2c60 c01-dbn-412-tzm44-master-2 Ready control-plane,master 2d23h v1.27.4+deb2c60 c01-dbn-412-tzm44-worker-0-7w6wg Ready worker 2d22h v1.25.11+1485cc9 c01-dbn-412-tzm44-worker-0-cmqsl Ready worker 2d22h v1.25.11+1485cc9 c01-dbn-412-tzm44-worker-0-qrp6v Ready worker 2d22h v1.25.11+1485cc9 [cloud-user@ocp-psi-executor dbasunag]$
Expected results:
EUS upgrade should work without error
Additional info:
Must-gather can be found here: https://drive.google.com/drive/folders/1SCZoYpGiRpOteTM-sTLmbfgr3hqsICVO?usp=drive_link
Description of problem:
CredentialsRequest for Azure AD Workload Identity missing disk encryption set read permissions. - Microsoft.Compute/diskEncryptionSets/read
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
Every time when creating a machine with a disk encryption set
Steps to Reproduce:
1. Create workload identity cluster 2. Create keyvault and secret within keyvault 3. Create disk encryption set and point it to keyvault; can use system-assigned identity 4. Create or modify existing machineset to include a disk encryption set. managedDisk: diskEncryptionSet: id: /subscriptions/<subscription_id>/resourceGroups/<resource_id>/providers/Microsoft.Compute/diskEncryptionSets/<disk_encryption_set_name> 5. Scale machineset
Actual results:
'failed to create vm <vm_name>: failure sending request for machine steven-wi-cluster-pzqvm-worker-eastus3-mfk5z: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=403 -- Original Error: Code="LinkedAuthorizationFailed" Message="The client ''55c10ba9-f891-4f42-a697-0ab283b86c63'' with object id ''55c10ba9-f891-4f42-a697-0ab283b86c63'' has permission to perform action ''Microsoft.Compute/virtualMachines/write'' on scope ''/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Compute/virtualMachines/steven-wi-cluster-pzqvm-worker-eastus3-mfk5z''; however, it does not have permission to perform action ''read'' on the linked scope(s) ''/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Compute/diskEncryptionSets/test-disk-encryption-set'' or the linked scope(s) are invalid."'
Expected results:
The machine is able to create and join the cluster successfully.
Additional info:
Docs about preparing disk encryption sets on Azure: https://docs.openshift.com/container-platform/4.12/installing/installing_azure/enabling-user-managed-encryption-azure.html
The `kubectl.kubernetes.io/default-container` annotation can be set on a pod to specify the default container for logs and terminal. The console doesn't honor the annotation. For example:
Description of problem:
Labels added in the Git import flow are not propagated to the pipeline resources when a pipeline is added
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Goto Git Import Form 2. Add Pipeline 3. Add labels 4. Submit the form
Actual results:
The added labels are not propagated to the pipeline resources
Expected results:
The added labels should be added to the pipeline resources
Additional info:
Please review the following PR: https://github.com/openshift/etcd/pull/208
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Cannot list Kepler CSV
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Install Kepler Community Operator 2. Create Kepler Instance 3. Console gets error and shows "Oops, something went wrong"
Actual results:
Console gets error and shows "Oops, something went wrong"
Expected results:
Should list Kepler Instance
Additional info:
OAuth-Proxy should send an Audit-Id header with its requests to the kube-apiserver so that we can easily track its requests and be able to tell which arrived and which were processed.
This comes from a time when the CI was in disarray and oauth-proxy requests were failing to reach the KAS but we did not know if at least any were processed or if they were just all plainly rejected somewhere in the middle.
Description of the problem:
assisted-service pod crashloops with kube-api enabled without the BMH CRD.
How reproducible:
100%
Steps to reproduce:
1. Deploy assisted-service will kube-api enabled
2. Either don't create or remove the BMH CRD (if removed you will need to restart the assisted-service pod)
3. Observe assisted-service pod
Actual results:
After a few minutes assisted-service will crash with a message like:
time="2023-01-12T14:26:03Z" level=fatal msg="failed to run manager" func=main.main.func1 file="/remote-source/assisted-service/app/cmd/main.go:204" error="failed to wait for baremetal-agent-controller caches to sync: timed out waiting for cache to be synced"
Expected results:
Either assisted service comes up without the BMAC controller and without errors or a clear error stating that the BMH CRD is required and is missing.
Description of problem:
The test for updating the sysctl whitelist fails to check the error returned when the pod running state is verified. Test is always passing. We failed to detect a bug in the cluster network operator for the allowlist controller.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-image-registry-operator/pull/855
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
OCP 4.11 ships the alertingrules CRD as a techpreview feature. Before graduating to GA we need to have e2e tests in the CMO repository.
AC:
Description of problem:
When running the nutanix-e2e-windows test from the WMCO PR https://github.com/openshift/windows-machine-config-operator/pull/1398, the MAPI nutanix-controller failed to create the Windows machine VM with the below error logs. It failed to marshal the windows-user-data to struct IgnitionConfig, since the windows-user-data is in powershell script format, but not the ignition data format. I0424 17:37:43.472054 1 recorder.go:103] events "msg"="ci-op-zhi8pd1k-5c595-dnpj5-e2e-wm-f84vt: reconciler failed to Create machine: failed to get user data: Failed to unmarshal userData to IgnitionConfig. invalid character '<' looking for beginning of value" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ci-op-zhi8pd1k-5c595-dnpj5-e2e-wm-f84vt","uid":"d3981cb0-4f98-4424-9252-b100521c2a93","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"31045"} "reason"="FailedCreate" "type"="Warning" E0424 17:37:43.472923 1 controller.go:329] "msg"="Reconciler error" "error"="ci-op-zhi8pd1k-5c595-dnpj5-e2e-wm-f84vt: reconciler failed to Create machine: failed to get user data: Failed to unmarshal userData to IgnitionConfig. invalid character '<' looking for beginning of value" "controller"="machine-controller" "name"="ci-op-zhi8pd1k-5c595-dnpj5-e2e-wm-f84vt" "namespace"="openshift-machine-api" "object"={"name":"ci-op-zhi8pd1k-5c595-dnpj5-e2e-wm-f84vt","namespace":"openshift-machine-api"} "reconcileID"="16572b5d-2418-4f7c-b7a8-5f08f2659391"
Version-Release number of selected component (if applicable):
How reproducible:
When the Machine is configured to be Windows node
Steps to Reproduce:
Run the ci/prow/nutanix-e2e-operator test.
Actual results:
The MAPI nutanix-controller failed to create the Windows VM with the error logs showing above.
Expected results:
The Windows VM and node can be successfully created and provisioned.
Additional info:
From deads2k: I think creating pods that should get rejected in the kube-system namespace would ensure it. OCP-classic is still struggling with customers who did naughty things.
Description of problem:
There are several labels used by the Nutanix platform which can vary between instances. If not set as ignore labels on the Cluster Autoscaler, features such as balancing similar node groups will not work predictably. The Cluster Autoscaler Operator should be updated with the following labels on Nutanix: * nutanix.com/prism-element-name * nutanix.com/prism-element-uuid * nutanix.com/prism-host-name * nutanix.com/prism-host-uuid for reference see this code: https://github.com/openshift/cluster-autoscaler-operator/blob/release-4.14/pkg/controller/clusterautoscaler/clusterautoscaler.go#L72-L159
Version-Release number of selected component (if applicable):
master, 4.14
How reproducible:
always
Steps to Reproduce:
1. create a ClusterAutoscaler CR on Nutanix platform 2. inspect the deployment for the cluster-autoscaler 3. see that it does not have the ignore labels added as command line flags
Actual results:
labels are not added as flags
Expected results:
labels should be added as flags
Additional info:
this should proabably be backported to 4.13 as well since the labels will be applied by the Nutanix CCM
Description of problem:
Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/255
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
We should log vcenter version information in plain text.
There are cases in code where vcenter version that we receive from vcenter could become unparseable. I see errors in problem-detector while parsing the version and both CSI driver and operator depends on ability to determine vcenter version.
A clone of https://issues.redhat.com/browse/OCPBUGS-11143 but for the downstream openshift/cloud-provider-azure
Description of problem:
On azure, delete a master, old machine stuck in Deleting, some pods in cluster are in ImagePullBackOff, check from azure console, new master did not add into lb backend, seems this lead the machine has no internet connection.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-02-12-024338
How reproducible:
Always
Steps to Reproduce:
1. Set up a cluster on Azure, networkType ovn 2. Delete a master 3. Check master and pod
Actual results:
Old machine stuck in Deleting, some pods are in ImagePullBackOff. $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaz2132-5ctmh-master-0 Deleting Standard_D8s_v3 westus 160m zhsunaz2132-5ctmh-master-1 Running Standard_D8s_v3 westus 160m zhsunaz2132-5ctmh-master-2 Running Standard_D8s_v3 westus 160m zhsunaz2132-5ctmh-master-flqqr-0 Running Standard_D8s_v3 westus 105m zhsunaz2132-5ctmh-worker-westus-dhwfz Running Standard_D4s_v3 westus 152m zhsunaz2132-5ctmh-worker-westus-dw895 Running Standard_D4s_v3 westus 152m zhsunaz2132-5ctmh-worker-westus-xlsgm Running Standard_D4s_v3 westus 152m $ oc describe machine zhsunaz2132-5ctmh-master-flqqr-0 -n openshift-machine-api |grep -i "Load Balancer" Internal Load Balancer: zhsunaz2132-5ctmh-internal Public Load Balancer: zhsunaz2132-5ctmh $ oc get node NAME STATUS ROLES AGE VERSION zhsunaz2132-5ctmh-master-0 Ready control-plane,master 165m v1.26.0+149fe52 zhsunaz2132-5ctmh-master-1 Ready control-plane,master 165m v1.26.0+149fe52 zhsunaz2132-5ctmh-master-2 Ready control-plane,master 165m v1.26.0+149fe52 zhsunaz2132-5ctmh-master-flqqr-0 NotReady control-plane,master 109m v1.26.0+149fe52 zhsunaz2132-5ctmh-worker-westus-dhwfz Ready worker 152m v1.26.0+149fe52 zhsunaz2132-5ctmh-worker-westus-dw895 Ready worker 152m v1.26.0+149fe52 zhsunaz2132-5ctmh-worker-westus-xlsgm Ready worker 152m v1.26.0+149fe52 $ oc describe node zhsunaz2132-5ctmh-master-flqqr-0 Warning ErrorReconcilingNode 3m5s (x181 over 108m) controlplane [k8s.ovn.org/node-chassis-id annotation not found for node zhsunaz2132-5ctmh-master-flqqr-0, macAddress annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0" , k8s.ovn.org/l3-gateway-config annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0"] $ oc get po --all-namespaces | grep ImagePullBackOf openshift-cluster-csi-drivers azure-disk-csi-driver-node-l8ng4 0/3 Init:ImagePullBackOff 0 113m openshift-cluster-csi-drivers azure-file-csi-driver-node-99k82 0/3 Init:ImagePullBackOff 0 113m openshift-cluster-node-tuning-operator tuned-bvvh7 0/1 ImagePullBackOff 0 113m openshift-dns node-resolver-2p4zq 0/1 ImagePullBackOff 0 113m openshift-image-registry node-ca-vxv87 0/1 ImagePullBackOff 0 113m openshift-machine-config-operator machine-config-daemon-crt5w 1/2 ImagePullBackOff 0 113m openshift-monitoring node-exporter-mmjsm 0/2 Init:ImagePullBackOff 0 113m openshift-multus multus-4cg87 0/1 ImagePullBackOff 0 113m openshift-multus multus-additional-cni-plugins-mc6vx 0/1 Init:ImagePullBackOff 0 113m openshift-ovn-kubernetes ovnkube-master-qjjsv 0/6 ImagePullBackOff 0 113m openshift-ovn-kubernetes ovnkube-node-k8w6j 0/6 ImagePullBackOff 0 113m
Expected results:
Replace master successful
Additional info:
Tested payload 4.13.0-0.nightly-2023-02-03-145213, same result. Before we have tested in 4.13.0-0.nightly-2023-01-27-165107, all works well.
Helm view in Dev console doesn't allow you to edit Helm repositories through the three dots menu "Edit option". It results in 404.
Tried in 4.13 only, not sure if other versions are affected
1. Create a new Helm chart repository (/ns/<NAMESPACE>/helmchartrepositories/~new/form endpoint)
2. List all the custom Helm repositories ( /helm-releases/ns/<NAMESPACE>/repositories endpoint)
3. Click three dots menu on the right of any chart repository and select "Edit ProjectHelmChartRepository" (leads to /k8s/ns/<NAMESPACE>/helmchartrepositories/<REPO_NAME>/edit)
4. You land on 404 page
404 page, see the attached GIF
Edit view
Always
Observed in OCP 4.13 (Dev sandbox and OpenShift Local)
Follow steps 1 and 2. from the reproducer above
3. Click on Helm repository name
4. Click YAML tab to edit resource (/k8s/ns/<NAMESPACE>/helm.openshift.io~v1beta1~ProjectHelmChartRepository/<REPO_NAME>/yaml endpoint)
Description of the problem:
Since MGMT-13083 merged, disconnected jobs are failing in the ephemeral installer (specifically e2e-agent-sno-ipv6 and e2e-agent-ha-dualstack). Preparing for installation fails because we can't get the installer binary:
Apr 21 10:00:43 master-0 service[2298]: time="2023-04-20T22:00:43Z" level=info msg="Successfully extracted openshift-baremetal-install binary from the release to: /data/install-config-generate/installercache/virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install" func="github.com/openshift/assisted-service/internal/oc.(*release).extractFromRelease" file="/src/internal/oc/release.go:376" cluster_id=a3945e90-44a8-436c-89ad-12d3a5820a26 go-id=18956 request_id= Apr 21 10:00:43 master-0 service[2298]: time="2023-04-20T22:00:43Z" level=error msg="failed generating install config for cluster a3945e90-44a8-436c-89ad-12d3a5820a26" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).generateClusterInstallConfig" file="/src/internal/bminventory/inventory.go:1738" cluster_id=a3945e90-44a8-436c-89ad-12d3a5820a26 error="failed to get installer path: Failed to create hard link to binary /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install: link /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/ln_1682028043_openshift-baremetal-install: no such file or directory" go-id=18956 pkg=Inventory request_id= Apr 21 10:00:43 master-0 service[2298]: time="2023-04-20T22:00:43Z" level=warning msg="Cluster installation initialization failed" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).InstallClusterInternal.func3.1" file="/src/internal/bminventory/inventory.go:1339" cluster_id=a3945e90-44a8-436c-89ad-12d3a5820a26 error="failed generating install config for cluster a3945e90-44a8-436c-89ad-12d3a5820a26: failed to get installer path: Failed to create hard link to binary /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install: link /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/ln_1682028043_openshift-baremetal-install: no such file or directory" go-id=18932 pkg=Inventory request_id=ca799c5a-c798-4a93-9bf8-7f27ed93ca20 Apr 21 10:00:43 master-0 service[2298]: time="2023-04-20T22:00:43Z" level=warning msg="Failed to prepare installation of cluster a3945e90-44a8-436c-89ad-12d3a5820a26" func="github.com/openshift/assisted-service/internal/cluster.(*Manager).HandlePreInstallError" file="/src/internal/cluster/cluster.go:985" cluster_id=a3945e90-44a8-436c-89ad-12d3a5820a26 error="failed generating install config for cluster a3945e90-44a8-436c-89ad-12d3a5820a26: failed to get installer path: Failed to create hard link to binary /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install: link /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/openshift-baremetal-install /data/install-config-generate/installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release@sha256:63357ac661a312dde07b60350ea72428463853ea9a09cdf9487d853496a97d58/ln_1682028043_openshift-baremetal-install: no such file or directory" go-id=18956 pkg=cluster-state request_id=
The issue appears to be that we extract the binary to a path including the mirror registry (installercache/virthost.ostest.test.metalkube.org:5000/localimages/local-release-image) but then look for it at a path representing the original pullspec (installercache/registry.build05.ci.openshift.org/ci-op-1w73h6fv/release)
How reproducible:
100%
Steps to reproduce:
1. Use the agent-based installer to install using a disconnected mirror registry in the ImageContentSources.
Actual results:
Installation never starts, we just see a loop of:
evel=debug msg=Host worker-0: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation) level=debug msg=Host worker-1: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation) level=debug msg=Host master-0: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation) level=debug msg=Host master-1: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation) level=debug msg=Host master-2: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation) level=debug msg=Host worker-0: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation) level=debug msg=Host worker-1: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation) level=debug msg=Host master-0: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation) level=debug msg=Host master-1: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation) level=debug msg=Host master-2: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation) level=debug msg=Host worker-0: updated status from preparing-successful to known (Host is ready to be installed) level=debug msg=Host worker-1: updated status from preparing-successful to known (Host is ready to be installed) level=debug msg=Host master-0: updated status from preparing-successful to known (Host is ready to be installed) level=debug msg=Host master-1: updated status from preparing-successful to known (Host is ready to be installed) level=debug msg=Host master-2: updated status from preparing-successful to known (Host is ready to be installed)
Expected results:
Cluster is installed.
Description of problem:
IngressVIP is getting attached to two node at once.
Version-Release number of selected component (if applicable):
4.11.39
How reproducible:
Always in customer cluster
Actual results:
IngressVIP is getting attached to two node at once.
Expected results:
IngressVIP should get attach to only one node.
Additional info:
This is a clone of issue OCPBUGS-18954. The following is the description of the original issue:
—
Description of problem:
While installing 3618 SNOs via ZTP using ACM 2.9, 15 clusters failed to complete install and have failed on the cluster-autoscaler operator. This represents the bulk of all cluster install failures in this testbed for OCP 4.14.0-rc.0. # cat aci.InstallationFailed.autoscaler | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion --no-headers " vm00527 version False True 20h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm00717 version False True 14h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm00881 version False True 19h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm00998 version False True 18h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01006 version False True 17h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01059 version False True 15h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01155 version False True 14h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01930 version False True 17h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm02407 version False True 16h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm02651 version False True 18h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03073 version False True 19h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03258 version False True 20h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03295 version False True 14h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03303 version False True 15h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03517 version False True 18h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
Version-Release number of selected component (if applicable):
Hub 4.13.11 Deployed SNOs 4.14.0-rc.0 ACM 2.9 - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52
How reproducible:
15 out of 20 failures (75% of the failures) 15 out of 3618 total attempted SNOs to be installed ~.4% of all installs
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
It appears that some show in the logs of the cluster-autoscaler-operator an error, Example: I0912 19:54:39.962897 1 main.go:15] Go Version: go1.20.5 X:strictfipsruntime I0912 19:54:39.962977 1 main.go:16] Go OS/Arch: linux/amd64 I0912 19:54:39.962982 1 main.go:17] Version: cluster-autoscaler-operator v4.14.0-202308301903.p0.gb57f5a9.assembly.stream-dirty I0912 19:54:39.963137 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}. I0912 19:54:39.975478 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"="127.0.0.1:9191" I0912 19:54:39.976939 1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-clusterautoscalers" I0912 19:54:39.976984 1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-machineautoscalers" I0912 19:54:39.977082 1 main.go:41] Starting cluster-autoscaler-operator I0912 19:54:39.977216 1 server.go:216] controller-runtime/webhook/webhooks "msg"="Starting webhook server" I0912 19:54:39.977693 1 certwatcher.go:161] controller-runtime/certwatcher "msg"="Updated current TLS certificate" I0912 19:54:39.977813 1 server.go:273] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=8443 I0912 19:54:39.977938 1 certwatcher.go:115] controller-runtime/certwatcher "msg"="Starting certificate watcher" I0912 19:54:39.978008 1 server.go:50] "msg"="starting server" "addr"={"IP":"127.0.0.1","Port":9191,"Zone":""} "kind"="metrics" "path"="/metrics" I0912 19:54:39.978052 1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler-operator-leader... I0912 19:54:39.982052 1 leaderelection.go:255] successfully acquired lease openshift-machine-api/cluster-autoscaler-operator-leader I0912 19:54:39.983412 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ClusterAutoscaler" I0912 19:54:39.983462 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Deployment" I0912 19:54:39.983483 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Service" I0912 19:54:39.983501 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ServiceMonitor" I0912 19:54:39.983520 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.PrometheusRule" I0912 19:54:39.983532 1 controller.go:185] "msg"="Starting Controller" "controller"="cluster_autoscaler_controller" I0912 19:54:39.986041 1 controller.go:177] "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *v1beta1.MachineAutoscaler" I0912 19:54:39.986065 1 controller.go:177] "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *unstructured.Unstructured" I0912 19:54:39.986072 1 controller.go:185] "msg"="Starting Controller" "controller"="machine_autoscaler_controller" I0912 19:54:40.095808 1 webhookconfig.go:72] Webhook configuration status: created I0912 19:54:40.101613 1 controller.go:219] "msg"="Starting workers" "controller"="cluster_autoscaler_controller" "worker count"=1 I0912 19:54:40.102857 1 controller.go:219] "msg"="Starting workers" "controller"="machine_autoscaler_controller" "worker count"=1 E0912 19:58:48.113290 1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF E0912 20:02:48.135610 1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused E0913 13:49:02.118757 1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused
Description of problem:
Terraform will not create VMs for master and worker for upi vsphere when unset var.control_plane_ip_addresses and var.compute_ip_addresses. When users are using IPAM (as before) to reserve IPs instead of setting static IPs directly into var.control_plane_ip_addresses and var.compute_ip_addresses, Based on upstream code #1 and #2. The count of master and worker is always 0, then terraform will not create any VMs for master and worker nodes. If we changed code as below, it works in IPAM case as before. control_plane_fqdns = [for idx in range(length(var.control_plane_ip_addresses)) : "control-plane-${idx}.${var.cluster_domain}"] compute_fqdns = [for idx in range(length(var.compute_ip_addresses)) : "compute-${idx}.${var.cluster_domain}"] ==>> control_plane_fqdns = [for idx in range(var.control_plane_count) : "control-plane-${idx}.${var.cluster_domain}"] compute_fqdns = [for idx in range(var.compute_count) : "compute-${idx}.${var.cluster_domain}"]
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-11-033820
How reproducible:
always
Steps to Reproduce:
1.Trigger job to install a cluster on vSphere with upi. 2.If the ip applied for master and worker VMs from IPAM server instead of setting the static ip directly into var.control_plane_ip_addresses and var.compute_ip_addresses, the VM creation will fail.
Actual results:
the VM creation will fail
Expected results:
VM creation succeeds.
Additional info:
#1 link:https://github.com/openshift/installer/blob/master/upi/vsphere/main.tf#L15-L16 #2 link:https://github.com/openshift/installer/blob/master/upi/vsphere/main.tf#L211 This bug will only affect UPI vSphere installation when user use IPAM server to reserve static IPs instead of setting static ip directly into var.control_plane_ip_addresses and var.compute_ip_addresses. now it don't affect QE test, because we still install with previous code.
Description of problem:
Create BuildConfig button in the Dev console builds opens the form view but in default namespace
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Goto Dev Perspective 2. Click on Builds 3. Click on "Create BuildConfig"
Actual results:
"default" namespace is selected in the namespace selector
Expected results:
It should open the form in the active namespace
Additional info:
Description of problem:
In hypershift context: Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/ https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265 These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator. This could be done by looking at the operator deployment itself or at the HCP resource. aws-ebs-csi-driver-controller aws-ebs-csi-driver-operator csi-snapshot-controller csi-snapshot-webhook
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a hypershift cluster. 2. Check affinity rules and node selector of the operands above. 3.
Actual results:
Operands missing affinity rules and node selecto
Expected results:
Operands have same affinity rules and node selector than the operator
Additional info:
The aggregated https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-ovn-rt-upgrade-4.14-minor-release-openshift-release-analysis-aggregator/1633554110798106624 job failed. Digging into one of them:
Deployments: * ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4f28fbcd049025bab9719379492420f9eaab0426cdbbba43b395eb8421f10a17 Digest: sha256:4f28fbcd049025bab9719379492420f9eaab0426cdbbba43b395eb8421f10a17 Version: 413.86.202302230536-0 (2023-03-08T20:10:47Z) RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-372.43.1.el8_6 LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules kernel-rt-modules-extra ... E0308 22:11:21.925030 74176 writer.go:200] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd299b2bf3cc98fb70907f152b4281633064fe33527b5d6a42ddc418ff00eec1 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd299b2bf3cc98fb70907f152b4281633064fe33527b5d6a42ddc418ff00eec1: [0m[31merror: [0mImporting: remote error: fetching blob: received unexpected HTTP status: 500 Internal Server Error ... I0308 22:11:36.959143 74176 update.go:2010] Running: rpm-ostree override reset kernel kernel-core kernel-modules kernel-modules-extra --uninstall kernel-rt-core --uninstall kernel-rt-kvm --uninstall kernel-rt-modules --uninstall kernel-rt-modules-extra ... E0308 22:12:35.525156 74176 writer.go:200] Marking Degraded due to: error running rpm-ostree override reset kernel kernel-core kernel-modules kernel-modules-extra --uninstall kernel-rt-core --uninstall kernel-rt-kvm --uninstall kernel-rt-modules --uninstall kernel-rt-modules-extra: [0m[31merror: [0mPackage/capability 'kernel-rt-core' is not currently requested : exit status 1
Something is going wrong here in our retry loop. I think it might be that we don't clear the pending deployment on failure. IOW we need to
rpm-ostree cleanup -p
before we rertry.
This is fallout from https://github.com/openshift/machine-config-operator/pull/3580 - Although I suspect it may have been an issue before too.
"pipelines-as-code-pipelinerun-go" configMap is not been used for the Go repository while creating Pipeline Repository. "pipelines-as-code-pipelinerun-generic" configMap has been used.
Install Red Hat Pipeline operator
`pipelines-as-code-pipelinerun-generic` PipelineRun template has been shown on the overview page
`pipelines-as-code-pipelinerun-go` PipelineRun template should show on the overview page
4.13
Description of problem:
We need to export the hook function from the module that's required in the dynamic core api, otherwise an exception will be thrown if the hook is imported/used by plugins.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Plugins using this hook throw an exception.
Expected results:
The hook should be imported and function properly.
Additional info:
Description of problem:
Enabling IPSec doesn't result in IPsec tunnels being created
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Deploy & Enable IPSec
Steps to Reproduce:
1. 2. 3.
Actual results:
000 Total IPsec connections: loaded 0, active 0 000 000 State Information: DDoS cookies not required, Accepting new IKE connections 000 IKE SAs: total(0), half-open(0), open(0), authenticated(0), anonymous(0) 000 IPsec SAs: total(0), authenticated(0), anonymous(0)
Expected results:
Active connections > 0
Additional info:
✘-1 ~/code/k8s-netperf [more-meta L|✚ 4…37⚑ 1] 06:49 $ oc -n openshift-ovn-kubernetes -c nbdb rsh ovnkube-master-qw4zv \ovn-nbctl --no-leader-only get nb_global . ipsec true
Description of problem:
While installing ocp on aws user can set metadataService auth to Required in order to use IMDSv2, in that case user requires all the vms to use it. Currently bootstrap will always run with Optional and this can be blocked on users aws account and will fail the installation process
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
Install aws cluster and set metadataService to Required
Steps to Reproduce:
1. 2. 3.
Actual results:
Bootstrap has IMDSv2 set to optional
Expected results:
All vms had IMDSv2 set to required
Additional info:
Description of problem:
Newly introduced `--idms-file` in oc image extract is incorrectly mapped to ICSPFile object instead IDMSFile
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
SNO installation performed with the assisted-installer failed
Version-Release number of selected component (if applicable):
4.10.32
# oc get co authentication -o yaml - lastTransitionTime: '2023-01-30T00:51:11Z' message: 'IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server OAuthServerConfigObservationDegraded: secret "v4-0-config-system-router-certs" not found OAuthServerDeploymentDegraded: 1 of 1 requested instances are unavailable for oauth-openshift.openshift-authentication (container is waiting in pending oauth-openshift-58b978d7f8-s6x4b pod) OAuthServerRouteEndpointAccessibleControllerDegraded: secret "v4-0-config-system-router-certs" # oc logs ingress-operator-xxx-yyy -c ingress-operator 2023-01-30T08:14:13.701799050Z 2023-01-30T08:14:13.701Z ERROR operator.certificate_publisher_controller certificate-publisher/controller.go:80 failed to list ingresscontrollers for secret {"related": "", "error": "Index with name field:defaultCertificateName does not exist"} Restarting the ingress-operator pod helped fix the issue, but a permanent fix is required. The Bug(https://bugzilla.redhat.com/show_bug.cgi?id=2005351) was filed earlier but closed due to inactivity.
Description of problem:
Add storage admission plugin "storage.openshift.io/CSIInlineVolumeSecurity"
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1.Create OCP cluster v 4.13 2.Check config map kas-config
Actual results:
The CM does not include "storage.openshift.io/CSIInlineVolumeSecurity" storage plugin
Expected results:
The plugin should be included
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/195
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Fix cnf compute tests to check scheduler settings under /sys/kernel/debug/sched/
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/355
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
NetworkPolicyLegacy test timeout on bump PR, the latest is https://github.com/openshift/origin/pull/27912 Job example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27912/pull-ci-openshift-origin-master-e2e-gcp-ovn/1655997089001246720 Seems like the problem is 15 min timeout, test fails with " Interrupted by User". I think this is change that affected it https://github.com/kubernetes/kubernetes/pull/112923. From what I saw in the logs, seems like "testCannotConnect" reaches 5 min timeout instead of completing in ~45 sec based on the client pod command. But this is NetworkPolicyLegacy, not sure how much time we want to spend debugging it.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Slack thread https://redhat-internal.slack.com/archives/C04UQLWQAP3/p1683640905643069
This is a clone of issue OCPBUGS-17682. The following is the description of the original issue:
—
Description of problem:
since in-cluster prometheus-operator and UWM prometheus-operator pods are scheduled to master nodes, see from
enabled UWM and add topologySpreadConstraints for in-cluster prometheus-operator and UWM prometheus-operator(set topologyKey to node-role.kubernetes.io/master), topologySpreadConstraints takes effect for in-cluster prometheus-operator, but not for UWM prometheus-operator
apiVersion: v1 data: config.yaml: | enableUserWorkload: true prometheusOperator: topologySpreadConstraints: - maxSkew: 1 topologyKey: node-role.kubernetes.io/master whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app.kubernetes.io/name: prometheus-operator kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring
in-cluster prometheus-operator, topologySpreadConstraints settings are loaded to prometheus-operator pod and deployment, see
$ oc -n openshift-monitoring get deploy prometheus-operator -oyaml | grep topologySpreadConstraints -A7 topologySpreadConstraints: - labelSelector: matchLabels: app.kubernetes.io/name: prometheus-operator maxSkew: 1 topologyKey: node-role.kubernetes.io/master whenUnsatisfiable: DoNotSchedule volumes: $ oc -n openshift-monitoring get pod -l app.kubernetes.io/name=prometheus-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES prometheus-operator-65496d5b78-fb9nq 2/2 Running 0 105s 10.128.0.71 juzhao-0813-szb9h-master-0.c.openshift-qe.internal <none> <none> $ oc -n openshift-monitoring get pod prometheus-operator-65496d5b78-fb9nq -oyaml | grep topologySpreadConstraints -A7 topologySpreadConstraints: - labelSelector: matchLabels: app.kubernetes.io/name: prometheus-operator maxSkew: 1 topologyKey: node-role.kubernetes.io/master whenUnsatisfiable: DoNotSchedule volumes:
but the topologySpreadConstraints settings are not loaded to UWM prometheus-operator pod and deployment
$ oc -n openshift-user-workload-monitoring get cm user-workload-monitoring-config -oyaml apiVersion: v1 data: config.yaml: | prometheusOperator: topologySpreadConstraints: - maxSkew: 1 topologyKey: node-role.kubernetes.io/master whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app.kubernetes.io/name: prometheus-operator kind: ConfigMap metadata: creationTimestamp: "2023-08-14T08:10:49Z" labels: app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/part-of: openshift-monitoring name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring resourceVersion: "212490" uid: 048f91cb-4da6-4b1b-9e1f-c769096ab88c $ oc -n openshift-user-workload-monitoring get deploy prometheus-operator -oyaml | grep topologySpreadConstraints -A7 no result $ oc -n openshift-user-workload-monitoring get pod -l app.kubernetes.io/name=prometheus-operator NAME READY STATUS RESTARTS AGE prometheus-operator-77bcdcbd9c-m5x8z 2/2 Running 0 15m $ oc -n openshift-user-workload-monitoring get pod prometheus-operator-77bcdcbd9c-m5x8z -oyaml | grep topologySpreadConstraints no result
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-11-055332
How reproducible:
always
Steps to Reproduce:
1. see the description 2. 3.
Actual results:
topologySpreadConstraints settings are not loaded to UWM prometheus-operator pod and deployment
Expected results:
topologySpreadConstraints settings loaded to UWM prometheus-operator pod and deployment
This is a clone of issue OCPBUGS-17391. The following is the description of the original issue:
—
the pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-to-shared-gateway-mode-migration job started failing recently when the
ovnkube-master daemonset would not finish rolling out after 360s.
taking the must gather to debug which happens a few minutes after the test
failure you can see that the daemonset is still not ready, so I believe that
increasing the timeout is not the answer.
some debug info:
➜ static-kas git:(master) oc --kubeconfig=/tmp/kk get daemonsets -A NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE openshift-cluster-csi-drivers aws-ebs-csi-driver-node 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-cluster-node-tuning-operator tuned 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-dns dns-default 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-dns node-resolver 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-image-registry node-ca 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-ingress-canary ingress-canary 3 3 3 3 3 kubernetes.io/os=linux 8h openshift-machine-api machine-api-termination-handler 0 0 0 0 0 kubernetes.io/os=linux,machine.openshift.io/interruptible-instance= 8h openshift-machine-config-operator machine-config-daemon 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-machine-config-operator machine-config-server 3 3 3 3 3 node-role.kubernetes.io/master= 8h openshift-monitoring node-exporter 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-multus multus 6 6 6 6 6 kubernetes.io/os=linux 9h openshift-multus multus-additional-cni-plugins 6 6 6 6 6 kubernetes.io/os=linux 9h openshift-multus network-metrics-daemon 6 6 6 6 6 kubernetes.io/os=linux 9h openshift-network-diagnostics network-check-target 6 6 6 6 6 beta.kubernetes.io/os=linux 9h openshift-ovn-kubernetes ovnkube-master 3 3 2 2 2 beta.kubernetes.io/os=linux,node-role.kubernetes.io/master= 9h openshift-ovn-kubernetes ovnkube-node 6 6 6 6 6 beta.kubernetes.io/os=linux 9h Name: ovnkube-master Selector: app=ovnkube-master Node-Selector: beta.kubernetes.io/os=linux,node-role.kubernetes.io/master= Labels: networkoperator.openshift.io/generates-operator-status=stand-alone Annotations: deprecated.daemonset.template.generation: 3 kubernetes.io/description: This daemonset launches the ovn-kubernetes controller (master) networking components. networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14 networkoperator.openshift.io/hybrid-overlay-status: disabled networkoperator.openshift.io/ip-family-mode: single-stack release.openshift.io/version: 4.14.0-0.ci.test-2023-08-04-123014-ci-op-c6fp05f4-latest Desired Number of Nodes Scheduled: 3 Current Number of Nodes Scheduled: 3 Number of Nodes Scheduled with Up-to-date Pods: 2 Number of Nodes Scheduled with Available Pods: 2 Number of Nodes Misscheduled: 0 Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed Pod Template: Labels: app=ovnkube-master component=network kubernetes.io/os=linux openshift.io/component=network ovn-db-pod=true type=infra Annotations: networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14 networkoperator.openshift.io/hybrid-overlay-status: disabled networkoperator.openshift.io/ip-family-mode: single-stack target.workload.openshift.io/management: {"effect": "PreferredDuringScheduling"} Service Account: ovn-kubernetes-controller
it seems there is one pod that is not coming up all the way and that pod has
two containers not ready (sbdb and nbdb). logs from those containers below:
➜ static-kas git:(master) oc --kubeconfig=/tmp/kk describe pod ovnkube-master-7qlm5 -n openshift-ovn-kubernetes | rg '^ [a-z].*:|Ready' northd: Ready: True nbdb: Ready: False kube-rbac-proxy: Ready: True sbdb: Ready: False ovnkube-master: Ready: True ovn-dbchecker: Ready: True ➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c sbdb 2023-08-04T13:08:49.127480354Z + [[ -f /env/_master ]] 2023-08-04T13:08:49.127562165Z + trap quit TERM INT 2023-08-04T13:08:49.127609496Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes 2023-08-04T13:08:49.127637926Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' 2023-08-04T13:08:49.127637926Z + transport=ssl 2023-08-04T13:08:49.127645167Z + ovn_raft_conn_ip_url_suffix= 2023-08-04T13:08:49.127682687Z + [[ 10.0.42.108 == \: ]] 2023-08-04T13:08:49.127690638Z + db=sb 2023-08-04T13:08:49.127690638Z + db_port=9642 2023-08-04T13:08:49.127712038Z + ovn_db_file=/etc/ovn/ovnsb_db.db 2023-08-04T13:08:49.127854181Z + [[ ! ssl:10.0.102.2:9642,ssl:10.0.42.108:9642,ssl:10.0.74.128:9642 =~ .:10\.0\.42\.108:. ]] 2023-08-04T13:08:49.128199437Z ++ bracketify 10.0.42.108 2023-08-04T13:08:49.128237768Z ++ case "$1" in 2023-08-04T13:08:49.128265838Z ++ echo 10.0.42.108 2023-08-04T13:08:49.128493242Z + OVN_ARGS='--db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt' 2023-08-04T13:08:49.128535253Z + CLUSTER_INITIATOR_IP=10.0.102.2 2023-08-04T13:08:49.128819438Z ++ date -Iseconds 2023-08-04T13:08:49.130157063Z 2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2 2023-08-04T13:08:49.130170893Z + echo '2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2' 2023-08-04T13:08:49.130170893Z + initialize=false 2023-08-04T13:08:49.130179713Z + [[ ! -e /etc/ovn/ovnsb_db.db ]] 2023-08-04T13:08:49.130318475Z + [[ false == \t\r\u\e ]] 2023-08-04T13:08:49.130406657Z + wait 9 2023-08-04T13:08:49.130493659Z + exec /usr/share/ovn/scripts/ovn-ctl -db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-sb-log=-vconsole:info -vfile:off -vPATTERN:console:%D {%Y-%m-%dT%H:%M:%S.###Z} |%05N|%c%T|%p|%m' run_sb_ovsdb 2023-08-04T13:08:49.208399304Z 2023-08-04T13:08:49.208Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-sb.log 2023-08-04T13:08:49.213507987Z ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed (No such file or directory) 2023-08-04T13:08:49.224890005Z 2023-08-04T13:08:49Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2023-08-04T13:08:49.224912156Z 2023-08-04T13:08:49Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connection attempt failed (No such file or directory) 2023-08-04T13:08:49.255474964Z 2023-08-04T13:08:49.255Z|00002|raft|INFO|local server ID is 7f92 2023-08-04T13:08:49.333342909Z 2023-08-04T13:08:49.333Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2 2023-08-04T13:08:49.348948944Z 2023-08-04T13:08:49.348Z|00004|reconnect|INFO|ssl:10.0.102.2:9644: connecting... 2023-08-04T13:08:49.349002565Z 2023-08-04T13:08:49.348Z|00005|reconnect|INFO|ssl:10.0.74.128:9644: connecting... 2023-08-04T13:08:49.352510569Z 2023-08-04T13:08:49.352Z|00006|reconnect|INFO|ssl:10.0.102.2:9644: connected 2023-08-04T13:08:49.353870484Z 2023-08-04T13:08:49.353Z|00007|reconnect|INFO|ssl:10.0.74.128:9644: connected 2023-08-04T13:08:49.889326777Z 2023-08-04T13:08:49.889Z|00008|raft|INFO|server 2501 is leader for term 5 2023-08-04T13:08:49.890316765Z 2023-08-04T13:08:49.890Z|00009|raft|INFO|rejecting append_request because previous entry 5,1538 not in local log (mismatch past end of log) 2023-08-04T13:08:49.891199951Z 2023-08-04T13:08:49.891Z|00010|raft|INFO|rejecting append_request because previous entry 5,1539 not in local log (mismatch past end of log) 2023-08-04T13:08:50.225632838Z 2023-08-04T13:08:50Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2023-08-04T13:08:50.225677739Z 2023-08-04T13:08:50Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected 2023-08-04T13:08:50.227772827Z Waiting for OVN_Southbound to come up. 2023-08-04T13:08:55.716284614Z 2023-08-04T13:08:55.716Z|00011|raft|INFO|ssl:10.0.74.128:43498: learned server ID 3dff 2023-08-04T13:08:55.716323395Z 2023-08-04T13:08:55.716Z|00012|raft|INFO|ssl:10.0.74.128:43498: learned remote address ssl:10.0.74.128:9644 2023-08-04T13:08:55.724570375Z 2023-08-04T13:08:55.724Z|00013|raft|INFO|ssl:10.0.102.2:47804: learned server ID 2501 2023-08-04T13:08:55.724599466Z 2023-08-04T13:08:55.724Z|00014|raft|INFO|ssl:10.0.102.2:47804: learned remote address ssl:10.0.102.2:9644 2023-08-04T13:08:59.348572779Z 2023-08-04T13:08:59.348Z|00015|memory|INFO|32296 kB peak resident set size after 10.1 seconds 2023-08-04T13:08:59.348648190Z 2023-08-04T13:08:59.348Z|00016|memory|INFO|atoms:35959 cells:31476 monitors:0 n-weak-refs:749 raft-connections:4 raft-log:1543 txn-history:100 txn-history-atoms:7100 ➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c nbdb 2023-08-04T13:08:48.779743434Z + [[ -f /env/_master ]] 2023-08-04T13:08:48.779743434Z + trap quit TERM INT 2023-08-04T13:08:48.779825516Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes 2023-08-04T13:08:48.779825516Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' 2023-08-04T13:08:48.779825516Z + transport=ssl 2023-08-04T13:08:48.779825516Z + ovn_raft_conn_ip_url_suffix= 2023-08-04T13:08:48.779825516Z + [[ 10.0.42.108 == \: ]] 2023-08-04T13:08:48.779825516Z + db=nb 2023-08-04T13:08:48.779825516Z + db_port=9641 2023-08-04T13:08:48.779825516Z + ovn_db_file=/etc/ovn/ovnnb_db.db 2023-08-04T13:08:48.779887606Z + [[ ! ssl:10.0.102.2:9641,ssl:10.0.42.108:9641,ssl:10.0.74.128:9641 =~ .:10\.0\.42\.108:. ]] 2023-08-04T13:08:48.780159182Z ++ bracketify 10.0.42.108 2023-08-04T13:08:48.780167142Z ++ case "$1" in 2023-08-04T13:08:48.780172102Z ++ echo 10.0.42.108 2023-08-04T13:08:48.780314224Z + OVN_ARGS='--db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt' 2023-08-04T13:08:48.780314224Z + CLUSTER_INITIATOR_IP=10.0.102.2 2023-08-04T13:08:48.780518588Z ++ date -Iseconds 2023-08-04T13:08:48.781738820Z 2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108 2023-08-04T13:08:48.781753021Z + echo '2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108' 2023-08-04T13:08:48.781753021Z + initialize=false 2023-08-04T13:08:48.781753021Z + [[ ! -e /etc/ovn/ovnnb_db.db ]] 2023-08-04T13:08:48.781816342Z + [[ false == \t\r\u\e ]] 2023-08-04T13:08:48.781936684Z + wait 9 2023-08-04T13:08:48.781974715Z + exec /usr/share/ovn/scripts/ovn-ctl -db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-nb-log=-vconsole:info -vfile:off -vPATTERN:console:%D {%Y-%m-%dT%H:%M:%S.###Z} |%05N|%c%T|%p|%m' run_nb_ovsdb 2023-08-04T13:08:48.851644059Z 2023-08-04T13:08:48.851Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2023-08-04T13:08:48.852091247Z ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory) 2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2023-08-04T13:08:48.875126148Z 2023-08-04T13:08:48.875Z|00002|raft|INFO|local server ID is c503 2023-08-04T13:08:48.911846610Z 2023-08-04T13:08:48.911Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2 2023-08-04T13:08:48.918864408Z 2023-08-04T13:08:48.918Z|00004|reconnect|INFO|ssl:10.0.102.2:9643: connecting... 2023-08-04T13:08:48.918934490Z 2023-08-04T13:08:48.918Z|00005|reconnect|INFO|ssl:10.0.74.128:9643: connecting... 2023-08-04T13:08:48.923439162Z 2023-08-04T13:08:48.923Z|00006|reconnect|INFO|ssl:10.0.102.2:9643: connected 2023-08-04T13:08:48.925166154Z 2023-08-04T13:08:48.925Z|00007|reconnect|INFO|ssl:10.0.74.128:9643: connected 2023-08-04T13:08:49.861650961Z 2023-08-04T13:08:49Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2023-08-04T13:08:49.861747153Z 2023-08-04T13:08:49Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected 2023-08-04T13:08:49.875272530Z 2023-08-04T13:08:49.875Z|00008|raft|INFO|server fccb is leader for term 6 2023-08-04T13:08:49.875302480Z 2023-08-04T13:08:49.875Z|00009|raft|INFO|rejecting append_request because previous entry 6,1732 not in local log (mismatch past end of log) 2023-08-04T13:08:49.876027164Z Waiting for OVN_Northbound to come up. 2023-08-04T13:08:55.694760761Z 2023-08-04T13:08:55.694Z|00010|raft|INFO|ssl:10.0.74.128:57122: learned server ID d382 2023-08-04T13:08:55.694800872Z 2023-08-04T13:08:55.694Z|00011|raft|INFO|ssl:10.0.74.128:57122: learned remote address ssl:10.0.74.128:9643 2023-08-04T13:08:55.706904913Z 2023-08-04T13:08:55.706Z|00012|raft|INFO|ssl:10.0.102.2:43230: learned server ID fccb 2023-08-04T13:08:55.706931733Z 2023-08-04T13:08:55.706Z|00013|raft|INFO|ssl:10.0.102.2:43230: learned remote address ssl:10.0.102.2:9643 2023-08-04T13:08:58.919567770Z 2023-08-04T13:08:58.919Z|00014|memory|INFO|21944 kB peak resident set size after 10.1 seconds 2023-08-04T13:08:58.919643762Z 2023-08-04T13:08:58.919Z|00015|memory|INFO|atoms:8471 cells:7481 monitors:0 n-weak-refs:200 raft-connections:4 raft-log:1737 txn-history:72 txn-history-atoms:8165 ➜ static-kas git:(master)
This seems to happen very frequently now, but was not happening before around July 21st.
Description of problem:
When attempting to add nodes to a long-lived 4.12.3 cluster, net new nodes are not able to join the cluster. They are provisioned in the cloud provider (AWS), but never actually join as a node.
Version-Release number of selected component (if applicable):
4.12.3
How reproducible:
Consistent
Steps to Reproduce:
1. On a long lived cluster, add a new machineset
Actual results:
Machines reach "Provisioned" but don't join the cluster
Expected results:
Machines join cluster as nodes
Additional info:
Currently, the installer has a dependency on the main assisted-service go module. This means that we pull in all of it's dependencies, which include libnmstate (the Rust one). In practice, this means that we can't update assisted-service at least until AGENT-139 is implemented. And since the main assisted-service module and the API module should be in lockstep, this means we can't update to pick up recent changes to the ZTP API either.
Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/271
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The following test case is failing: Error: Timeout - Async callback was not invoked within timeout specified by jasmine.DEFAULT_TIMEOUT_INTERVAL. exception Error: Timeout - Async callback was not invoked within timeout specified by jasmine.DEFAULT_TIMEOUT_INTERVAL. Tests scenario is failing with an 85% failure rate:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/12892/pull-ci-openshift-console-master-e2e-gcp-console/1668916100596764672
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/12892/pull-ci-openshift-console-master-e2e-gcp-console/1668916100596764672/artifacts/e2e-gcp-console/test/artifacts/gui_test_screenshots/c8b0a6b0614b41eee9ea123ffe9a3bea.png
Description of problem:
We have OCP 4:10 installed along with Tigera 3.13 with no issues. We could also update OCP to 4:11 and 4:12 along with Tigera upgrade to 3.15 and 3.16. The upgrade works with no issue. The problem appears when we install Tigera 3.16 along with OCP 4.12. (fresh install) Tigera support says OCP install parameters need to be updated to accommodate new Tigera updates. Its either in the Terraform Plug-in or file called main.tf that need update. Please engage someone from RedHat OCP engineering team.
Ref doc: https://access.redhat.com/solutions/6980264
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
install Tigera 3.16 along with OCP 4.12. (fresh install)
Actual results:
Installation fails with the error: "rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5330750 vs. 4194304)"
Expected results:
Just like 4.10, 4.12 installation should work with Tigera calico
Additional info:
Description of problem:
According to https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-limits.html, the default Security groups number per network interface is 5 and could be 16 at most, so we better to have some pre-check on the number of provided custom security groups. When it's more than 15(since the maximum is 16, but installer will also create one ${var.cluster_id}-master-sg/${var.cluster_id}-worker-sg), installer should quit and warn user about this.
Version-Release number of selected component (if applicable):
registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-07-11-092038
How reproducible:
Always
Steps to Reproduce:
1. Set 16 Security groups IDs in compute.platform.aws.additionalSecurityGroupIDs compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: aws: additionalSecurityGroupIDs: - sg-06e63a6ad731c10cc - sg-054614d4f4eb5751d - sg-05c4fe202c8e2c28c - sg-0c948fa8b85bf4af1 - sg-0cfb0c91c0b48f0de - sg-0eff6077ca727c921 - sg-0d2d1f41f1ac9801c - sg-047c67d5decb64563 - sg-0ee63f164c0ab8b04 - sg-033ff80fa12e43c7f - sg-0ccad43754d9652cd - sg-04e4cbca2b5d50c3a - sg-0d133411fdcb0a4e0 - sg-0b2b0e0d515b2f561 - sg-045fde620b3e702da - sg-07e0493a65749973c replicas: 3 2. The installation failed due to workers couldn't be provisioned.
Actual results:
[root@preserve-gpei-worker k_files]# oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api gpei-0613g-wp7zw-master-0 Running m6i.xlarge us-west-2 us-west-2a 66m openshift-machine-api gpei-0613g-wp7zw-master-1 Running m6i.xlarge us-west-2 us-west-2b 66m openshift-machine-api gpei-0613g-wp7zw-master-2 Running m6i.xlarge us-west-2 us-west-2a 66m openshift-machine-api gpei-0613g-wp7zw-worker-us-west-2a-7rszc Failed 62m openshift-machine-api gpei-0613g-wp7zw-worker-us-west-2a-pwnvp Failed 62m openshift-machine-api gpei-0613g-wp7zw-worker-us-west-2b-n2cs9 Failed 62m [root@preserve-gpei-worker k_files]# oc describe machine gpei-0613g-wp7zw-worker-us-west-2b-n2cs9 -n openshift-machine-api Name: gpei-0613g-wp7zw-worker-us-west-2b-n2cs9 .. Spec: Lifecycle Hooks: Metadata: Provider Spec: Value: Ami: Id: ami-01bfc200595c748a1 API Version: machine.openshift.io/v1beta1 Block Devices: Ebs: Metadata Service Options: Placement: Availability Zone: us-west-2b Region: us-west-2 Security Groups: Filters: Name: tag:Name Values: gpei-0613g-wp7zw-worker-sg Id: sg-033ff80fa12e43c7f Id: sg-045fde620b3e702da Id: sg-047c67d5decb64563 Id: sg-04e4cbca2b5d50c3a Id: sg-054614d4f4eb5751d Id: sg-05c4fe202c8e2c28c Id: sg-06e63a6ad731c10cc Id: sg-07e0493a65749973c Id: sg-0b2b0e0d515b2f561 Id: sg-0c948fa8b85bf4af1 Id: sg-0ccad43754d9652cd Id: sg-0cfb0c91c0b48f0de Id: sg-0d133411fdcb0a4e0 Id: sg-0d2d1f41f1ac9801c Id: sg-0ee63f164c0ab8b04 Id: sg-0eff6077ca727c921 Subnet: Id: subnet-0641814f00311bd9c Tags: Name: kubernetes.io/cluster/gpei-0613g-wp7zw Value: owned User Data Secret: Name: worker-user-data Status: Conditions: Last Transition Time: 2023-07-13T09:58:02Z Status: True Type: Drainable Last Transition Time: 2023-07-13T09:58:02Z Message: Instance has not been created Reason: InstanceNotCreated Severity: Warning Status: False Type: InstanceExists Last Transition Time: 2023-07-13T09:58:02Z Status: True Type: Terminable Error Message: error launching instance: You have exceeded the maximum number of security groups allowed per network interface.
Expected results:
Installer could abort and prompt the provided custom security group number exceeded the maximum number allowed.
Additional info:
Related to TRT-849, we want to write a test to see how often this is happening before we undertake a major effort to get to the bottom of it.
The test will need to process disruption across all backends, look for DNS lookup disruptions, and then see if we have overlap with non-DNS lookup disruptions within those timeframes.
We have some precedent for similar code in KubePodNotReady alerts that we handle differently if in proximity to other intervals.
The test should flake, we can then see how often it's happening in sippy and on what platforms. With sql we could likely pinpoint to certain build clusters as well.
Description of problem:
According to the Red Hat documentation https://docs.openshift.com/container-platform/4.12/networking/ovn_kubernetes_network_provider/configuring-egress-ips-ovn.html, the maximum number of IP aliases per node is 10 - "Per node, the maximum number of IP aliases, both IPv4 and IPv6, is 10.". Looking at the code base, the number of allowed IPs is calculated as Capacity = defaultGCPPrivateIPCapacity (which is set to 10) + cloudPrivateIPsCount (that is number of available IPs from the range) - currentIPv4Usage (number of assigned v4 IPs) - currentIPv6Usage (number of assigned v6 IPs) https://github.com/openshift/cloud-network-config-controller/blob/master/pkg/cloudprovider/gcp.go#L18-L22 Speaking to GCP, they support up to 100 alias IP ranges (not IPs) per vNIC. Can Red Hat confirm 1) If there is a limitation of 10 from OCP and why? 2) If there isn't a limit, what is the maximum number of egress IPs that could be supported per node?
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Case: 03487893
It is one of the most highlighted bug from our customer.
This is a clone of issue OCPBUGS-13044. The following is the description of the original issue:
—
Description of problem:
During cluster installations/upgrades with an imageContentSourcePolicy in place but with access to quay.io, the ICSP is not honored to pull the machine-os-content image from a private registry.
Version-Release number of selected component (if applicable):
$ oc logs -n openshift-machine-config-operator ds/machine-config-daemon -c machine-config-daemon|head -1 Found 6 pods, using pod/machine-config-daemon-znknf I0503 10:53:00.925942 2377 start.go:112] Version: v4.12.0-202304070941.p0.g87fedee.assembly.stream-dirty (87fedee690ae487f8ae044ac416000172c9576a5)
How reproducible:
100% in clusters with ICSP configured BUT with access to quay.io
Steps to Reproduce:
1. Create mirror repo: $ cat <<EOF > /tmp/isc.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 archiveSize: 4 storageConfig: registry: imageURL: quay.example.com/mirror/oc-mirror-metadata skipTLS: true mirror: platform: channels: - name: stable-4.12 type: ocp minVersion: 4.12.13 graph: true EOF $ oc mirror --dest-skip-tls --config=/tmp/isc.yaml docker://quay.example.com/mirror/oc-mirror-metadata <...> info: Mirroring completed in 2m27.91s (138.6MB/s) Writing image mapping to oc-mirror-workspace/results-1683104229/mapping.txt Writing UpdateService manifests to oc-mirror-workspace/results-1683104229 Writing ICSP manifests to oc-mirror-workspace/results-1683104229 2. Confirm machine-os-content digest: $ oc adm release info 4.12.13 -o jsonpath='{.references.spec.tags[?(@.name=="machine-os-content")].from}'|jq { "kind": "DockerImage", "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a1660c8086ff85e569e10b3bc9db344e1e1f7530581d742ad98b670a81477b1b" } $ oc adm release info 4.12.14 -o jsonpath='{.references.spec.tags[?(@.name=="machine-os-content")].from}'|jq { "kind": "DockerImage", "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ed68d04d720a83366626a11297a4f3c5761c0b44d02ef66fe4cbcc70a6854563" } 3. Create 4.12.13 cluster with ICSP at install time: $ grep imageContentSources -A6 ./install-config.yaml imageContentSources: - mirrors: - quay.example.com/mirror/oc-mirror-metadata/openshift/release source: quay.io/openshift-release-dev/ocp-v4.0-art-dev - mirrors: - quay.example.com/mirror/oc-mirror-metadata/openshift/release-images source: quay.io/openshift-release-dev/ocp-release
Actual results:
1. After the installation is completed, no pulls for a166 (4.12.13-x86_64-machine-os-content) are logged in the Quay usage logs whereas e.g. digest 22d2 (4.12.13-x86_64-machine-os-images) are reported to be pulled from the mirror. 2. After upgrading to 4.12.14 no pulls for ed68 (4.12.14-x86_64-machine-os-content) are logged in the mirror-registry while the image was pulled as part of `oc image extract` in the machine-config-daemon: [core@master-1 ~]$ sudo less /var/log/pods/openshift-machine-config-operator_machine-config-daemon-7fnjz_e2a3de54-1355-44f9-a516-2f89d6c6ab8f/machine-config-daemon/0.log 2023-05-03T10:51:43.308996195+00:00 stderr F I0503 10:51:43.308932 11290 run.go:19] Running: nice -- ionice -c 3 oc image extract -v 10 --path /:/run/mco-extensions/os-extensions-content-4035545447 --registry- config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad48fe01f3e82584197797ce2151eecdfdcce67ae1096f06412e5ace416f66ce 2023-05-03T10:51:43.418211869+00:00 stderr F I0503 10:51:43.418008 184455 client_mirrored.go:174] Attempting to connect to quay.io/openshift-release-dev/ocp-v4.0-art-dev 2023-05-03T10:51:43.418211869+00:00 stderr F I0503 10:51:43.418174 184455 round_trippers.go:466] curl -v -XGET -H "User-Agent: oc/4.12.0 (linux/amd64) kubernetes/31aa3e8" 'https://quay.io/v2/' 2023-05-03T10:51:43.419618513+00:00 stderr F I0503 10:51:43.419517 184455 round_trippers.go:495] HTTP Trace: DNS Lookup for quay.io resolved to [{34.206.15.82 } {54.209.210.231 } {52.5.187.29 } {52.3.168.193 } {52.21.36.23 } {50.17.122.58 } {44.194.68.221 } {34.194.241.136 } {2600:1f18:483:cf01:ebba:a861:1150:e245 } {2600:1f18:483:cf02:40f9:477f:ea6b:8a2b } {2600:1f18:483:cf02:8601:2257:9919:cd9e } {2600:1f18:483:cf01 :8212:fcdc:2a2a:50a7 } {2600:1f18:483:cf00:915d:9d2f:fc1f:40a7 } {2600:1f18:483:cf02:7a8b:1901:f1cf:3ab3 } {2600:1f18:483:cf00:27e2:dfeb:a6c7:c4db } {2600:1f18:483:cf01:ca3f:d96e:196c:7867 }] 2023-05-03T10:51:43.429298245+00:00 stderr F I0503 10:51:43.429151 184455 round_trippers.go:510] HTTP Trace: Dial to tcp:34.206.15.82:443 succeed
Expected results:
All images are pulled from the location as configured in the ICSP.
Additional info:
Description of problem:
When CNO is managed by Hypershift multus-admission-controller does not have correct RollingUpdate parameterts meeting Hypershift requirements outligned here: https://github.com/openshift/hypershift/blob/646bcef53e4ecb9ec01a05408bb2da8ffd832a14/support/config/deployment.go#L81 ``` There are two standard cases currently with hypershift: HA mode where there are 3 replicas spread across zones and then non ha with one replica. When only 3 zones are available you need to be able to set maxUnavailable in order to progress the rollout. However, you do not want to set that in the single replica case because it will result in downtime. ``` So when multus-admission-controller has more than one replica the RollingUpdate parameters should be ``` strategy: type: RollingUpdate rollingUpdate: maxSurge: 0 maxUnavailable: 1 ```
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1.Create OCP cluster using Hypershift 2.Check rolling update parameters of multus-admission-controller
Actual results:
the operator has default parameters: {"rollingUpdate":{"maxSurge":"25%","maxUnavailable":"25%"},"type":"RollingUpdate"}
Expected results:
{"rollingUpdate":{"maxSurge":0,"maxUnavailable":1},"type":"RollingUpdate"}
Additional info:
As a user I want to see what differs between the Machine's (current) ProviderSpec and the Control Plane Machine Set (desired) ProviderSpec so that I can understand why the CPMSO is replacing my control plane machine.
Work spawn out of discussions in https://redhat-internal.slack.com/archives/CCX9DB894/p1678820665803259 and https://redhat-internal.slack.com/archives/C04UB95G802
Believe we are already logging this, would be good to emit either an event or the diff into the status, whoever takes this card should investigate the best way of surfacing this.
Outcome:
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/726
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/1952
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The MCO's "Certificate Observability" CRD fields (introduced in MCO-607) are non-RFC3339 formatted strings and are unparseable as the API standard metav1.Time For context, the MCO is currently migrating its API to openshift/api where it needs to comply with API standards, and if these strings are still present in the API when 4.14 ships, we will be unable to upgrade from the shipping version to the one where the API has migrated, so we need to adjust this now before it ships.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1.Create a cluster 2.Observe ControllerConfig status.controllerCertificates 3.Observe MachineConfigPool status.certExpirys
Actual results:
Types are wrong, and strings are formatted thusly: 2033-08-12 01:47:54 +0000 UTC
Expected results:
ControllerConfig and MachineConfigPools do not contain certificate observability fields formatted as "2033-08-12 01:47:54 +0000 UTC". Either contain certificate observability fields formatted as "2006-01-02T15:04:05Z07:00" or should not contain them at all.
Additional info:
If we ship 4.14 with these strings how they are, we will be stuck like that and unable to easily upgrade out of it (because the new MCO that regards the fields as metav1.Time will be unable to parse the old strings), e.g. 2023-08-15T05:03:40.989575279Z W0815 05:03:40.989527 1 reflector.go:533] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:101: failed to list *v1.MachineConfigPool: parsing time "2033-08-12 01:47:54 +0000 UTC" as "2006-01-02T15:04:05Z07:00": cannot parse " 01:47:54 +0000 UTC" as "T" 2023-08-15T05:03:40.989575279Z E0815 05:03:40.989555 1 reflector.go:148] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfigPool: failed to list *v1.MachineConfigPool: parsing time "2033-08-12 01:47:54 +0000 UTC" as "2006-01-02T15:04:05Z07:00": cannot parse " 01:47:54 +0000 UTC" as "T" 2023-08-15T05:04:05.304139210Z W0815 05:04:05.304088 1 reflector.go:533] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:101: failed to list *v1.ControllerConfig: parsing time "2033-08-12 01:47:54 +0000 UTC" as "2006-01-02T15:04:05Z07:00": cannot parse " 01:47:54 +0000 UTC" as "T" 2023-08-15T05:04:05.304139210Z E0815 05:04:05.304121 1 reflector.go:148] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:101: Failed to watch *v1.ControllerConfig: failed to list *v1.ControllerConfig: parsing time "2033-08-12 01:47:54 +0000 UTC" as "2006-01-02T15:04:05Z07:00": cannot parse " 01:47:54 +0000 UTC" as "T"
Allow creating a single NAT gateway for a multi-zone hosted cluster. The route table in other zones should point to the one NAT gateway.
This allows running a cluster in multiple zones with a single NAT gateway which can be expensive to run in AWS.
Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/30
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Some repositories require bugzilla/valid-bug label present. Complement to https://issues.redhat.com/browse/WRKLDS-700.
Description of problem:
ControllerConfig renders properly until Infrastructure object changes, then: - 'Kind' and 'APIVersion' are no longer present on the object resulting from a "get" for that object via the lister and - as a result, the embedded dns and infrastructure objects in ControllerConfig fail to validate - this results in ControllerConfig failing to sync
Version-Release number of selected component (if applicable):
4.14 machine-config-operator
How reproducible:
I can reproduce it every time
Steps to Reproduce:
1.Build a 4.14 cluster 2.Update Infrastructure non-destructively, e.g.: oc annotate infrastructure cluster break.the.mco=yep 3.Watch the machine-config-operator pod logs (or oc get co, the error will propagate) to see the validation errors for the new controllerconfig
Actual results:
2023-05-17T20:45:04.627320107Z I0517 20:45:04.627281 1 event.go:285] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"d52d09f4-f7bb-497a-a5c3-92861aa6796f", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: MachineConfigControllerFailed' Failed to resync 4.14.0-0.ci.test-2023-05-17-193937-ci-op-dcrr8kjq-latest because: ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.infra.apiVersion: Required value: must not be empty, spec.infra.kind: Required value: must not be empty, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Expected results:
machine-config-operator quietly syncs controllerconfig :)
Additional info:
The MCO itself is not doing this. It's not part of resourcemerge or anything like that. It's happening "below" us. The short version here is that when using a typed client, the group,version,kind (GVK) gets stripped during decoding because it's redundant (you already know the type). For "top level" objects, it gets put back during an update request automatically, but it doesn't recurse into embedded objects (which Infrastructure and DNS are). So we end up with embedded objects that are missing explicit GVKs and won't validate. Why does it only happen after the objects change? We're using a lister, and the lister's "strip-on-decode" behavior seems a little inconsistent. Sometimes the GVK is populated. If you use a direct client "get", the GVK will never be populated. There is a lot of history on this behavior, it won't be changed any time soon, here are some entry points: - https://github.com/kubernetes/kubernetes/pull/63972 - https://github.com/kubernetes/kubernetes/issues/80609
Description of problem:
test "operator conditions control-plane-machine-set" fails https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216 control-plane-machine-set operator is Unavailable, because it doesn't reconcile node events. If a node becomes ready later than the referencing Machine, Node update event will not trigger reconciliation.
Version-Release number of selected component (if applicable):
How reproducible:
depends on the sequence of Node vs Machine events
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
operator logs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-machine-api_control-plane-machine-set-operator-5d5848c465-g4q2p_control-plane-machine-set-operator.log machines https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/machines.json nodes https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/nodes.json
Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/357
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-18304. The following is the description of the original issue:
—
Description of problem:
https://github.com/openshift/installer/pull/6770 reverted part of https://github.com/openshift/installer/pull/5788 which has set guestinfo.domain for bootstrap machine. This breaks some OKD installations, which require that setting
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
NodePool conditions AllMachinesReady and AllNodesHealthy are used by Cluster Service to detect problems on customer nodes. Everytime a NodePool is updated, it triggers an update in a ManifestWork that is being processed by CS to build a user message about why a specific machinepool/nodepool is not healthy. The lack of a sorted message when there are more than one machines creates a bug that the NodePool is updated multiple time, when the state is the same. For example, CS may capture scenarios like this and consider them like the change is the same. Machine rosa-vws58-workshop-69b55d58b-mq44p: UnhealthyNode Machine rosa-vws58-workshop-69b55d58b-97n47: UnhealthyNode , Machine rosa-vws58-workshop-69b55d58b-mq44p: NodeConditionsFailed Machine rosa-vws58-workshop-69b55d58b-97n47: Deleting , Machine rosa-vws58-workshop-69b55d58b-97n47: UnhealthyNode Machine rosa-vws58-workshop-69b55d58b-mq44p: UnhealthyNode , Machine rosa-vws58-workshop-69b55d58b-97n47: Deleting Machine rosa-vws58-workshop-69b55d58b-mq44p: NodeConditionsFailed , Machine rosa-vws58-workshop-69b55d58b-mq44p: UnhealthyNode Machine rosa-vws58-workshop-69b55d58b-97n47: UnhealthyNode , Machine rosa-vws58-workshop-69b55d58b-mq44p: NodeConditionsFailed Machine rosa-vws58-workshop-69b55d58b-97n47: Deleting ,
Expected results:
The HyperShift Operator should sort the messages where multiple machines/nodes are invovled: https://github.com/openshift/hypershift/blob/86af31a5a5cdee3da0d7f65f3bd550f4ec9cac55/hypershift-operator/controllers/nodepool/nodepool_controller.go#L2509
Description of problem:
we can see TypeErrors on operand creation page
Version-Release number of selected component (if applicable):
cluster-bot cluster launch 4.14-ci,openshift/console#12525
How reproducible:
Always
Steps to Reproduce:
1. create mock CRD and CSV files into project 'test' $ oc project test $ oc apply -f mock-crd-and-csv.yaml customresourcedefinition.apiextensions.k8s.io/mock-k8s-dropdown-resources.test.tectonic.com created clusterserviceversion.operators.coreos.com/mock-k8s-resource-dropdown-operator created 2. Goes to CR creation page Operators -> Installed Operators -> Mock K8sResourcePrefixOperator -> Mock Resource tab -> click on 'Create MockK8sDropdownResource' button
Actual results:
2. we can see errors Description: e is undefined Component trace: g@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/create-operand-chunk-b03c5cb69a738de3ba86.min.js:1:17026 v@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/create-operand-chunk-b03c5cb69a738de3ba86.min.js:1:54359 div N@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:173048 R@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:173543 _@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/create-operand-chunk-b03c5cb69a738de3ba86.min.js:1:20749 10807/t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/create-operand-chunk-b03c5cb69a738de3ba86.min.js:1:145 4156/t.default@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/create-operand-chunk-b03c5cb69a738de3ba86.min.js:1:22586 s@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:223444 t@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:21:69403 T t@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:21:71448 Suspense i@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:435931 section m@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:170312 div div t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1501506 div div c@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:699298 d@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:219161 div d@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:89596 l@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1151500 H<@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:442786 S@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:87:86675 main div v@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:466912 div div c@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:311348 div div c@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:699298 d@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:219161 div d@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendor-patternfly-core-chunk-277c96b9c656c5dae20f.min.js:1:89596 Jn@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:36:185686 t.default@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:854425 5404/t.default@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/quick-start-chunk-0b68859d1eaa39849249.min.js:1:1264 s@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:223444 t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1581508 ee@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1599747 St@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:36:142700 ee@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1599747 ee@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1599747 i@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:809765 t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1575685 t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1575874 t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1573290 te@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1599889 ne<@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1603021 r@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:36:122338 t@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:21:69403 t@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:21:71448 t@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:21:66008 re@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1603332 t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:783751 t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:1084331 s@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-4f4f3b36aabdf0eb831f.min.js:1:635039 t.a@https://console-openshift-console.apps.ci-ln-ykgji4b-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/vendors~main-chunk-8e90f77cf4a58a9d5a52.min.js:135:257437 Suspense
Expected results:
2. operand creation form/yaml page should be loaded successfully
Additional info:
mock-crd-and-csv.yaml and screenshot are at https://drive.google.com/drive/folders/1Z432vVMArHLgCgzu5IMGi9_oq3iRtezx
There is a workloads change, which is introducing DeploymentConfigs and Builds API as a capabilities, which gives the cluster admin option to enable/disable each of their API.
In case the DeploymentConfigs capability is disabled we should remove the `Deployment Config` subsection from `Workloads` nav section.
In case the Builds capability is disabled we should remove the `Builds` and `Build Configs` subsection from `Workloads` nav section.
This is a clone of issue OCPBUGS-7893. The following is the description of the original issue:
—
Description of problem:
The TaskRun duration diagram on the "Metrics" tab of pipeline is set to only show 4 TaskRuns in the legend regardless of the number of TaskRuns on the diagram.
Expected results:
All TaskRuns should be displayed in the legend.
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/61
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Hello, one of our customers had several cni-sysctl-allowlist-ds created (around 10.000 pods) in openshift-multus namespace. That caused several issues in the cluster, as nodes were full of pods an run out of IPs. After deleting them, the situation has improved. But we want to know the root cause of this issue. Searching in the network-operator pod logs, it seems that the customer faced some networking issues. After this issue, we can see that the cni-sysctl-allowlist pods started to be created. Could we know why the cni-sysctl-allowlist-ds pods were created?
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Unable to successfully create HyperShift KubeVirt HostedCluster on BM, control plane's pod/importer-prime-xxx can's be ready
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. HyperShift install operator 2. HyperShift create cluster KubeVirt xxx
Actual results:
➜ oc get pod -n clusters-3d9ec3c7e495f1c58da1 | grep "importer-prime" importer-prime-90175dc9-21bf-4f13-a021-6c42a2e19652 1/2 Error 16 (5m13s ago) 57m importer-prime-9f153661-1c2c-4b61-84fd-0a2d83f30699 1/2 Error 16 (5m4s ago) 57m importer-prime-cb817383-58bd-4480-a7e1-49ae42368cae 1/2 CrashLoopBackOff 15 (4m51s ago) 57m ➜ oc logs importer-prime-90175dc9-21bf-4f13-a021-6c42a2e19652 -c importer -n clusters-3d9ec3c7e495f1c58da1 I0728 18:41:20.106447 1 importer.go:103] Starting importer E0728 18:41:20.107346 1 importer.go:133] exit status 1, blockdev: cannot open /dev/cdi-block-volume: Permission denied kubevirt.io/containerized-data-importer/pkg/util.GetAvailableSpaceBlock /remote-source/app/pkg/util/util.go:136 kubevirt.io/containerized-data-importer/pkg/util.GetAvailableSpaceByVolumeMode /remote-source/app/pkg/util/util.go:106 main.main /remote-source/app/cmd/cdi-importer/importer.go:131 runtime.main /usr/lib/golang/src/runtime/proc.go:250 runtime.goexit /usr/lib/golang/src/runtime/asm_amd64.s:1598 ➜ oc get hostedcluster -n clusters 3d9ec3c7e495f1c58da1 -ojsonpath='{.status.version.desired}' | jq { "image": "registry.build01.ci.openshift.org/ci-op-ywf2rxrx/release@sha256:940a0463d1203888fb4e5fa4a09b69dc4eb3cc5d70dee22e1155c677aafca197", "version": "4.14.0-0.ci-2023-07-28-090906" } ➜ oc get hostedcluster -n clusters 3d9ec3c7e495f1c58da1 NAME VERSION KUBECONFIG PROGRESS AVAILABLE PROGRESSING MESSAGE 3d9ec3c7e495f1c58da1 3d9ec3c7e495f1c58da1-admin-kubeconfig Partial True False The hosted control plane is available ➜ oc get clusterversion version -ojsonpath='{.status.desired.image}' registry.build01.ci.openshift.org/ci-op-ywf2rxrx/release@sha256:940a0463d1203888fb4e5fa4a09b69dc4eb3cc5d70dee22e1155c677aafca197 ➜ oc get vmi -A No resources found
Expected results:
All pods on the control plane should be ready
Additional info:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/41772/rehearse-41772-periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-kubevirt-baremetalds-conformance/1684954151244533760
Description of problem:
container_network* metrics stop reporting after a container restarts. Other container_* metrics continue to report for the same pod.
How reproducible:
Issue can be reproduced by triggering a container restart
Steps to Reproduce:
1.Restart container 2.Check metrics and see container_network* not reporting
Additional info:
Ticket with more detailed debugging process OHSS-16739
First showed on https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-08-16-042125
Did not appear to happen on https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-08-15-200133
Changelog is getting huge but I diffed these two PRs:
❯ diff 1.txt 2.txt 2a3 > Use go 1.18 when setting up environment (#5422) #5422 15a17 > CFE-688: Update install-config CRD to support gcp labels and tags #7126 23a26,27 > OCPBUGS-17711: Revert “pkg/cli/admin/release/extract: Add –included and –install-config” #1527 > Update openshift/api #1525 28a33 > pkg/aws/actuator: Drop comment which suggested passthrough permission verification #590 49a55,59 > cluster-control-plane-machine-set-operator > > OCPCLOUD-2130: Add subnet to Azure FD, fix for optional fields in FD #229 > Full changelog > 64a75 > IR-373: remove node-ca daemon #867 126a138,147 > cluster-storage-operator > > STOR-1274: use granular permissons for Azure credential requests #388 > Full changelog > > cluster-version-operator > > CNF-9385: add ImageRegistry capability #950 > Full changelog > 132a154,158 > container-networking-plugins > > OCPBUGS-17681: Default CNI binaries to RHEL 8 #116 > Full changelog > 143a170,174 > haproxy-router > > OCPBUGS-17653: haproxy/template: mitigate CVE-2023-40225 #505 > Full changelog > 193a225,229 > monitoring-plugin > > OCPBUGS-17650: Fix /monitoring/ redirects #68 > Full changelog > 204a241,245 > openstack-machine-api-provider > > Bump CAPO to match branch release-0.7 #80 > Full changelog > 206a248,249 > OCPBUGS-17157: scripts: add a Go-based bumper, sync upstream #534 > Add ncdc to DOWNSTREAM_OWNERS #539 223a267 > update watch-endpoint-slices to usable shape #28184
Description of problem:
A runtime error is encountered when running the console backend in off-cluster mode against only one cluster (non-multicluster configuration)
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Follow readme instructions for running bridge locally 2. 3.
Actual results:
Bridge crashes with a runtime error
Expected results:
Bridge should run normally
Additional info:
Description of problem:
Alert Rules do not have summary/description
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
This bug is being raised by Openshift Monitoring team as part of effort to detect invalid Alert Rules in OCP. Check details of following Alert Rules 1. KubeletHealthState 2. MCCDrainError 3. MCDPivotError 4. MCDRebootError 5. SystemMemoryExceedsReservation
Actual results:
These Alert Rules do not have Summary/Description annotation, but have a 'message' annotation. OpenShift alerts must use 'description' -- consider renaming the annotation
Expected results:
Alerts should have Summary/Description annotation.
Additional info:
Alerts must have a summary/description annotation, please refer to style guide at https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide To resolve the bug, - Rename message annotation to summary/description annotation - Remove the exception in the origin test, added in PR https://github.com/openshift/origin/pull/27944
```
alert TargetDown fired for 13 seconds with labels:
```
Checking kubelet logs for all the nodes:
```
Aug 07 10:11:49.788245 libvirt-ppc64le-1-1-9-kfv8v-master-0 crio[1244]: time="2021-08-07 10:11:49.788169211Z" level=info msg="Started container dd7e2473c51870c1894531af9a3935b907340a31216f85c32e391bddf22d7fd0: openshift-machine-config-operator/machine-config-daemon-7r2bb/machine-config-daemon" id=15456b41-39c9-41ce-8f10-71398df6dd26 name=/runtime.v1alpha2.RuntimeService/StartContainer
Aug 07 10:11:49.265439 libvirt-ppc64le-1-1-9-kfv8v-master-1 crio[1242]: time="2021-08-07 10:11:49.264443242Z" level=info msg="Created container 0651d7904d63a3f2c1fa9177d2ccf890c8fc769e96c836074aa8cc28a8bd7e04: openshift-machine-config-operator/machine-config-daemon-pk29l/machine-config-daemon" id=a622e284-7d45-4b72-b271-c39081c2c77a name=/runtime.v1alpha2.RuntimeService/CreateContainer
Aug 07 10:11:49.602420 libvirt-ppc64le-1-1-9-kfv8v-master-2 crio[1243]: time="2021-08-07 10:11:49.602359290Z" level=info msg="Started container 5a24f464210595cd394aacd4e98903a196d67762a53d764bd6f4a6010cc17acf: openshift-machine-config-operator/machine-config-daemon-69fw6/machine-config-daemon" id=89b0650c-741e-4c61-ab49-f68aa82cb302 name=/runtime.v1alpha2.RuntimeService/StartContainer
Aug 07 10:15:54.666525 libvirt-ppc64le-1-1-9-kfv8v-worker-0-gddxw crio[1252]: time="2021-08-07 10:15:54.666233168Z" level=info msg="Started container 8ba32989af629e00c35578c51e9b5612ca8ddcf97b32f2b500d777a6eb2ff2e1: openshift-machine-config-operator/machine-config-daemon-5tb88/machine-config-daemon" id=4fa0e2ba-54aa-41a8-ab7b-7a3b6f6a9998 name=/runtime.v1alpha2.RuntimeService/StartContainer
Aug 07 10:16:14.170188 libvirt-ppc64le-1-1-9-kfv8v-worker-0-p76x7 crio[1235]: time="2021-08-07 10:16:14.170137303Z" level=info msg="Started container 78d933af1e7100050332b1df62e67d1fc71ca735c7a7d3c060411f61f32a0c74: openshift-machine-config-operator/machine-config-daemon-k6l8w/machine-config-daemon" id=c344fd94-abeb-4393-87f3-5bcaba21d45f name=/runtime.v1alpha2.RuntimeService/StartContainer
```
All containers started before the test started (before 2021-08-07T10:28:00Z, see https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1423947091704549376/build-log.txt). Checking https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1423947091704549376/artifacts/ocp-jenkins-e2e-remote-libvirt-ppc64le/gather-libvirt/artifacts/pods.json:
```
machine-config-daemon-5tb88_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-worker-0-gddxw, 0 restarts, ready since 2021-08-07T10:16:07Z
machine-config-daemon-k6l8w_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-worker-0-p76x7, 0 restarts, ready since 2021-08-07T10:16:14Z
machine-config-daemon-69fw6_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-2, 0 restarts, ready since 2021-08-07T10:11:49Z
machine-config-daemon-pk29l_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-1, 0 restarts, ready since 2021-08-07T10:11:49Z
machine-config-daemon-7r2bb_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-0, 0 restarts, ready since 2021-08-07T10:11:49Z
```
All containers were running since they got created and never restarted.
The incident (alert TargetDown fired for 13 seconds) occurred at August 7, 2021 10:33:18 AM. The test suite finished 2021-08-07T10:33:40Z.
Based on the TargetDown definition (see https://github.com/openshift/cluster-monitoring-operator/blob/001eccd81ff51af0ed7a9d463dd35bfa9b75d102/assets/cluster-monitoring-operator/prometheus-rule.yaml#L16-L28):
```
The machine-config-daemon was down for 15m and 13s. Given the test suite ran for ~5m42s (10:33:18-10:28:00), the target was down before the test suite started to run.
This patterns repears in other jobs as well:
Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/459
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
VPC endpoint service cannot be cleaned up by HyperShift operator when the OIDC provider of the customer cluster has been deleted.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Sometimes
Steps to Reproduce:
1.Create a HyperShift hosted cluster 2.Delete the HyperShift cluster's OIDC provider in AWS 3.Delete the HyperShift hosted cluster
Actual results:
Cluster is stuck deleting
Expected results:
Cluster deletes
Additional info:
The hypershift operator is stuck trying to delete the AWS endpoint service but it can't be deleted because it gets an error that there are active connections.
Description of problem:
Bump Kubernetes to 0.27.1 and bump dependencies
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
On a freshly installed cluster, the control-plane-machineset-operator begins rolling a new master node, but the machine remains in a Provisioned state and never joins as a node. Its status is: Drain operation currently blocked by: [{Name:EtcdQuorumOperator Owner:clusteroperator/etcd}] The cluster is left in this state until an admin manually removes the stuck master node, at which point a new master machine is provisioned and successfully joins the cluster.
Version-Release number of selected component (if applicable):
4.12.4
How reproducible:
Observed at least 4 times over the last week, but unsure on how to reproduce.
Actual results:
A master node remains in a stuck Provisioned state and requires manual deletion to unstick the control plane machine set process.
Expected results:
No manual interaction should be necessary.
Additional info:
Description of problem:
The certificates synced by MCO in 4.13 onwards are more comprehensive and correct, and out of sync issues will surface much faster. See https://issues.redhat.com/browse/MCO-499 for details
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1.Install 4.13, pause MCPs 2. 3.
Actual results:
Within ~24 hours the cluster will fire critical clusterdown alerts
Expected results:
No alerts fire
Additional info:
This PR will allow the installation of non-latest Operator channels and associated versions. https://github.com/openshift/console/pull/12743
When I version is installed that is not the `currentCSV` default version for a channel, The data returns `installed: false` and `installed state: "Not Installed"`
So the UI doesn't place an "Installed" label on the operator card in OperatorHub and the user doesn't see that it's already installed when viewing the operator details.
Version-Release number of selected component (if applicable):
4.14 cluster
Steps to Reproduce:
Animated screen gif of installed Data Grid version 8.4.3, the default latest version is 8.4.4
https://drive.google.com/file/d/1KVMCdflBYsI3yiLf2oQv69MoStgA5kof/view?usp=sharing
Actual results:
obj data returns `installState: "Not Installed" and `installed: false`
Expected results:
obj data returns `installState: "Installed" and `installed: true`
Additional info:
Requires 4.14 cluster to support installing previous versions and channels
Description of problem:
On 4.14, 'MachineAPI' is marked as optional capability which will disable two operators machine-api and cluster-autoscaler. epic link: https://issues.redhat.com/browse/CNF-6318 And operator machine-api is required for common IPI (no SNO and no compact) cluster, so if disabling "MachineAPI" in install-config.yaml, common IPI cluster will be installed failed. Suggest to have pre-check on installer side for common IPI (no SNO and no compact) when running "openshift-installer create cluster". If MachineAPI is disabled, installer should exit with corresponding messages.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-30-131338
How reproducible:
Always
Steps to Reproduce:
1. Prepare install-config.yaml and set baselineCapabilitySet as None, make sure that compute node number is greater than 0. 2. Run command "openshift-install create cluster" to install common IPI 3.
Actual results:
Installation failed since missing machine-api operator
Expected results:
Installer should have pre-check for this scenario and exit with error message if MachineAPI is disabled
Additional info:
Description of the problem:
We get the disk serial from ghw, which gets it from looking at 2 udev properties. There are a couple more recent udev properties that should be tried first, as lsblk does:
I have a PR open on ghw that should solve the issue. We'll need to update our version of ghw once it's merged.
See more info in the ABI ticket: https://issues.redhat.com/browse/OCPBUGS-18174
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/59
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
This test tends to be flakey; depending on how the cert changes are propagated. We rotate 2/7 certs in the bundle; if the changes don't get batched together, the assert to verify after the certs happens too soon causing the test to fail.
Version-Release number of selected component (if applicable):
4.14.0
Description of problem:
The statefulset thanos-ruler-user-workload no serviceName. As the document described, the serviceName is a must for Statefulset. I'm not sure if we need service here, but one question, if we don't need service, why not use a regular Deployment? Thanks!
MacBook-Pro:k8sgpt jianzhang$ oc explain statefulset.spec.serviceName KIND: StatefulSet VERSION: apps/v1FIELD: serviceName <string>DESCRIPTION: serviceName is the name of the service that governs this StatefulSet. This service must exist before the StatefulSet, and is responsible for the network identity of the set. Pods get DNS/hostnames that follow the pattern: pod-specific-string.serviceName.default.svc.cluster.local where "pod-specific-string" is managed by the StatefulSet controller. MacBook-Pro:k8sgpt jianzhang$ oc get statefulset -n openshift-user-workload-monitoring -o=jsonpath={.spec.serviceName} MacBook-Pro:k8sgpt jianzhang$ MacBook-Pro:k8sgpt jianzhang$ oc get statefulset -n openshift-user-workload-monitoring NAME READY AGE prometheus-user-workload 2/2 4h44m thanos-ruler-user-workload 2/2 4h44m MacBook-Pro:k8sgpt jianzhang$ oc get svc -n openshift-user-workload-monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus-operated ClusterIP None <none> 9090/TCP,10901/TCP 4h44m prometheus-operator ClusterIP None <none> 8443/TCP 4h44m prometheus-user-workload ClusterIP 172.30.46.204 <none> 9091/TCP,9092/TCP,10902/TCP 4h44m prometheus-user-workload-thanos-sidecar ClusterIP None <none> 10902/TCP 4h44m thanos-ruler ClusterIP 172.30.110.49 <none> 9091/TCP,9092/TCP,10901/TCP 4h44m thanos-ruler-operated ClusterIP None <none> 10902/TCP,10901/TCP 4h44m
Version-Release number of selected component (if applicable):
MacBook-Pro:k8sgpt jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-05-31-080250 True False 7h30m Cluster version is 4.14.0-0.nightly-2023-05-31-080250
How reproducible:
always
Steps to Reproduce:
1. Install OCP 4.14 cluster. 2. Check cluster's statefulset instances or run `k8sgpt analyze -d` 3.
Actual results:
MacBook-Pro:k8sgpt jianzhang$ k8sgpt analyze -d Service nfs-provisioner/example.com-nfs does not exist AI Provider: openai 0 openshift-user-workload-monitoring/thanos-ruler-user-workload(thanos-ruler-user-workload) - Error: StatefulSet uses the service openshift-user-workload-monitoring/ which does not exist. Kubernetes Doc: serviceName is the name of the service that governs this StatefulSet. This service must exist before the StatefulSet, and is responsible for the network identity of the set. Pods get DNS/hostnames that follow the pattern: pod-specific-string.serviceName.default.svc.cluster.local where "pod-specific-string" is managed by the StatefulSet controller.
Expected results:
There is the serviceName for statefulset.
Additional info:
Description of problem:
The script for checking the certs for Openshift install on openstack fails. https://docs.openshift.com/container-platform/4.12/installing/installing_openstack/preparing-to-install-on-openstack.html#security-osp-validating-certificates_preparing-to-install-on-openstack I see that the command "openstack catalog list --format json --column Name --column Endpoints" returns output as, ----------- [ { "Name": "heat-cfn", "Endpoints": "RegionOne\n admin: http://10.254.x.x:8000/v1\nRegionOne\n public: https://<domain_name>:8000/v1\nRegionOne\n internal: http://10.254.x.x:8000/v1\n" }, { "Name": "cinderv2", "Endpoints": "RegionOne\n admin: http://10.254.x.x:8776/v2/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n internal: http://10.254.x.x:8776/v2/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n public: https://<domain_name>:8776/v2/f36f2db6bb434484b71a45aa84b9d790\n" }, { "Name": "glance", "Endpoints": "RegionOne\n public: https://<domain_name>:9292\nRegionOne\n admin: http://10.254.x.x:9292\nRegionOne\n internal: http://10.254.x.x:9292\n" }, { "Name": "keystone", "Endpoints": "RegionOne\n internal: http://10.254.x.x:5000\nRegionOne\n admin: http://10.254.x.x:35357\nRegionOne\n public: https://<domain_name>:5000\n" }, { "Name": "swift", "Endpoints": "RegionOne\n admin: https://ch-dc-s3-gsn-33.eecloud.nsn-net.net:10032/swift/v1\nRegionOne\n public: https://ch-dc-s3-gsn-33.eecloud.nsn-net.net:10032/swift/v1\nRegionOne\n internal: https://ch-dc-s3-gsn-33.eecloud.nsn-net.net:10032/swift/v1\n" }, { "Name": "nova", "Endpoints": "RegionOne\n public: https://<domain_name>:8774/v2.1\nRegionOne\n internal: http://10.254.x.x:8774/v2.1\nRegionOne\n admin: http://10.254.x.x:8774/v2.1\n" }, { "Name": "heat", "Endpoints": "RegionOne\n internal: http://10.254.x.x:8004/v1/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n public: https://<domain_name>:8004/v1/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n admin: http://10.254.x.x:8004/v1/f36f2db6bb434484b71a45aa84b9d790\n" }, { "Name": "cinder", "Endpoints": "" }, { "Name": "cinderv3", "Endpoints": "RegionOne\n public: https://<domain_name>:8776/v3/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n admin: http://10.254.x.x:8776/v3/f36f2db6bb434484b71a45aa84b9d790\nRegionOne\n internal: http://10.254.x.x:8776/v3/f36f2db6bb434484b71a45aa84b9d790\n" }, { "Name": "neutron", "Endpoints": "RegionOne\n internal: http://10.254.x.x:9696\nRegionOne\n public: https://<domain_name>:9696\nRegionOne\n admin: http://10.254.x.x:9696\n" }, { "Name": "placement", "Endpoints": "RegionOne\n internal: http://10.254.x.x:8778\nRegionOne\n admin: http://10.254.x.x:8778\nRegionOne\n public: https://<domain_name>:8778\n" } ] ----------- Which then expected to be filtered with jq as " | jq -r '.[] | .Name as $name | .Endpoints[] | [$name, .interface, .url] | join(" ")'| sort " But it fails with error as, ---------------- ./certs.sh jq: error (at <stdin>:46): Cannot iterate over string ("RegionOne\...) Further check the script following commands execution is failing openstack catalog list --format json --column Name --column Endpoints \ > | jq -r '.[] | .Name as $name | .Endpoints[] | [$name, .interface, .url] | join(" ")' jq: error (at <stdin>:46): Cannot iterate over string ("RegionOne\...) ---------------- Where certs.sh is the script we copied from documentation. I did some debugs to get the things .interface,.url to internal,public,admin fields from endpoint but I'm not sure if that's way it is on openstack so marking this as BZ to have reviewed.
Version-Release number of selected component (if applicable):
Openshift Container Platform 4.12 on 3.18.1 release of openstack
How reproducible:
- Always
Steps to Reproduce:
1. Copy the script and run it on given release of openstack version. 2. 3.
Actual results:
Fails with parsing
Expected results:
Shouldn't fail.
Additional info:
Invoking 'create cluster-manifests' fails when imageContentSources is missing in install-config yaml:
$ openshift-install agent create cluster-manifests INFO Consuming Install Config from target directory FATAL failed to write asset (Mirror Registries Config) to disk: failed to write file: open .: is a directory
install-config.yaml:
apiVersion: v1alpha1 metadata: name: appliance rendezvousIP: 192.168.122.116 hosts: - hostname: sno installerArgs: '["--save-partlabel", "agent*", "--save-partlabel", "rhcos-*"]' interfaces: - name: enp1s0 macAddress: 52:54:00:e7:05:72 networkConfig: interfaces: - name: enp1s0 type: ethernet state: up mac-address: 52:54:00:e7:05:72 ipv4: enabled: true dhcp: true
Description of problem:
The following changes are required for openshift/route-controller-manager#22 refactoring.
add POD_NAME to route-controller-manager deployment
introduce route-controller-defaultconfig and customize lease name openshift-route-controllers to override the default one supplied by library-go
add RBAC for infrastructures which is used by library-go for configuring leader election
Description of problem:
We are seeing flakes in HyperShift CI jobs: https://search.ci.openshift.org/?search=Alerting+rule+%22CsvAbnormalFailedOver2Min%22&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Sample failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1692/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-hypershift/1664244482360479744 { fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:148]: Incompliant rules detected: Alerting rule "CsvAbnormalFailedOver2Min" (group: olm.csv_abnormal.rules) has no 'description' annotation, but has a 'message' annotation. OpenShift alerts must use 'description' -- consider renaming the annotation Alerting rule "CsvAbnormalFailedOver2Min" (group: olm.csv_abnormal.rules) has no 'summary' annotation Alerting rule "CsvAbnormalOver30Min" (group: olm.csv_abnormal.rules) has no 'description' annotation, but has a 'message' annotation. OpenShift alerts must use 'description' -- consider renaming the annotation Alerting rule "CsvAbnormalOver30Min" (group: olm.csv_abnormal.rules) has no 'summary' annotation Ginkgo exit error 1: exit with code 1}
Version-Release number of selected component (if applicable):
4.14 CI
How reproducible:
sometimes
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Serverless -> Eventing -> Channels, Values under conditions column are in Englis. Translator comments: "x OK/y" should be translated as "x个 OK(共y个)"
Version-Release number of selected component (if applicable):
4.13.0-ec.1
How reproducible:
always
Steps to Reproduce:
1. Navigate to Serverless -> Eventing -> Channels. 2. Values under Conditions column are in English. 3.
Actual results:
Content is in English.
Expected results:
Content should be in target language. x OK/y" should be translated as "x个 OK(共y个)"
Additional info:
screenshot provided
Description of the problem:
OCI platform is available only from OCP 4.14, we shouldn't be able to create an OCI cluster with OCP < 4.14
How reproducible:
You can reproduce with aicli
Steps to reproduce:
$ aicli --integration create cluster agentil-test-oci-19 -P platform='{"type": "oci"}' -P pull_secret=<your pull secret> -P user_managed_networking=true -P minimal=true -P openshift_version=4.13
Actual results:
[agentil@fedora Downloads]$ aicli --integration create cluster agentil-test-oci-19 -P platform='{"type": "oci"}' -P pull_secret=~/Downloads/pull-secret.txt -P user_managed_networking=true -P minimal=true -P openshift_version=4.13 Creating cluster agentil-test-oci-19 Using karmalabs.corp as DNS domain as no one was provided Forcing network_type to OVNKubernetes Using version 4.13.2 Creating infraenv agentil-test-oci-19_infra-env Using karmalabs.corp as DNS domain as no one was provided [agentil@fedora Downloads]$ aicli --integration info cluster agentil-test-oci-19 ams_subscription_id: 2QvJWtlvlUIvFtCmOIPiwkHRirC api_vips: [] base_dns_domain: karmalabs.corp cluster_networks: [{'cluster_id': '65f2a1fa-efd2-419a-9bf0-802e595a0a63', 'cidr': '10.128.0.0/14', 'host_prefix': 23}] connectivity_majority_groups: {"IPv4":[],"IPv6":[]} controller_logs_collected_at: 0001-01-01 00:00:00+00:00 controller_logs_started_at: 0001-01-01 00:00:00+00:00 cpu_architecture: x86_64 created_at: 2023-06-08 12:42:36.327854+00:00 disk_encryption: {'enable_on': 'none', 'mode': 'tpmv2', 'tang_servers': None} email_domain: redhat.com feature_usage: {"Cluster Tags":{"id":"CLUSTER_TAGS","name":"Cluster Tags"},"Hyperthreading":{"data":{"hyperthreading_enabled":"all"},"id":"HYPERTHREADING","name":"Hyperthreading"},"OVN network type":{"id":"OVN_NETWORK_TYPE","name":"OVN network type"},"Platform selection":{"data":{"platform_type":"oci"},"id":"PLATFORM_SELECTION","name":"Platform selection"},"User Managed Networking With Multi Node":{"id":"USER_MANAGED_NETWORKING_WITH_MULTI_NODE","name":"User Managed Networking With Multi Node"}} high_availability_mode: Full hyperthreading: all id: 65f2a1fa-efd2-419a-9bf0-802e595a0a63 ignition_endpoint: {'url': None, 'ca_certificate': None} imported: False ingress_vips: [] install_completed_at: 0001-01-01 00:00:00+00:00 install_started_at: 0001-01-01 00:00:00+00:00 ip_collisions: {} machine_networks: [] monitored_operators: [{'cluster_id': '65f2a1fa-efd2-419a-9bf0-802e595a0a63', 'name': 'console', 'version': None, 'namespace': None, 'subscription_name': None, 'operator_type': 'builtin', 'properties': None, 'timeout_seconds': 3600, 'status': None, 'status_info': None, 'status_updated_at': datetime.datetime(1, 1, 1, 0, 0, tzinfo=tzutc())}, {'cluster_id': '65f2a1fa-efd2-419a-9bf0-802e595a0a63', 'name': 'cvo', 'version': None, 'namespace': None, 'subscription_name': None, 'operator_type': 'builtin', 'properties': None, 'timeout_seconds': 3600, 'status': None, 'status_info': None, 'status_updated_at': datetime.datetime(1, 1, 1, 0, 0, tzinfo=tzutc())}] name: agentil-test-oci-19 network_type: OVNKubernetes ocp_release_image: quay.io/openshift-release-dev/ocp-release:4.13.2-x86_64 openshift_version: 4.13.2 org_id: 11009103 platform: {'type': 'oci'} progress: {'total_percentage': None, 'preparing_for_installation_stage_percentage': None, 'installing_stage_percentage': None, 'finalizing_stage_percentage': None} schedulable_masters: False schedulable_masters_forced_true: True service_networks: [{'cluster_id': '65f2a1fa-efd2-419a-9bf0-802e595a0a63', 'cidr': '172.30.0.0/16'}] status: insufficient status_info: Cluster is not ready for install status_updated_at: 2023-06-08 12:42:36.324000+00:00 tags: aicli updated_at: 2023-06-08 12:42:43.362119+00:00 user_managed_networking: True user_name: agentil@redhat.com
Expected results:
The cluster creation should fail because the version of OCP is incompatible with OCI platform.
Description of problem:
When authenticating openshift-install with the gcloud cli, rather than using a service account key file, the installer will throw an error because https://github.com/openshift/installer/blob/master/pkg/asset/machines/gcp/machines.go#L170-L178 ALWAYS expects to extract a service account to passthrough to nodes in XPN installs. An alternative approach would be to handle the lack of service account without error, and allow the required service accounts to passed in through another mechanism.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create install config for gcp xpn install 2. Authenticate installer without service account key file (either gcloud cli auth or through a VM). 3.
Actual results:
Expected results:
Additional info:
As discussed in https://issues.redhat.com/browse/MON-1634, adding ownerref will be put on hold for now until CMO has a CR.
In the meantime we'll add (let's hope temporary) labels to emphasize ownership, this will help guide users for now and help us highlight relations and how we can/want to express them using ownerref in the future. (See option 1 and option 2 in the doc above)
Description of problem:
"oc adm upgrade --to-multi-arch" command have no guard in cases where there's cluster conditions that may interfere with the transition, such as: Invalid=True, Failing=True, and Progressing=True
Steps to Reproduce:
either apply the command while an upgrade is in progress, or while there's cluster conditions such as Invalid=True or Failing=True
Actual results:
accepts the command
Expected results:
warns about the interfering condition, while allowing to progress only if --allow-upgrade-with-warnings is applied
Description of problem:
The e2e-nutanix test run failed at bootstrap stage when testing the PR https://github.com/openshift/cloud-provider-nutanix/pull/7. Could reproduce the bootstrap failure with the manual testing to create a Nutanix OCP cluster with the latest nutanix-ccm image. time="2023-03-06T12:25:56-05:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition" time="2023-03-06T12:25:56-05:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane." time="2023-03-06T12:25:56-05:00" level=warning msg="The bootstrap machine is unable to resolve API and/or API-Int Server URLs"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
From the PR https://github.com/openshift/cloud-provider-nutanix/pull/7, trigger the e2e-nutanix test. The test will fail at bootstrap stage with the described errors.
Actual results:
The e2e-nutanix test run failed at bootstrapping with the errors: level=error msg=Bootstrap failed to complete: timed out waiting for the condition level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
Expected results:
The e2e-nutanix test will pass
Additional info:
Investigation showed the root cause was the Nutanix cloud-controller-manager pod did not have permission to get/list ConfigMap resource. The error logs from the Nutanix cloud-controller-manager pod: E0307 16:08:31.753165 1 reflector.go:140] pkg/provider/client.go:124: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-controller-manager:cloud-controller-manager" cannot list resource "configmaps" in API group "" at the cluster scope I0307 16:09:30.050507 1 reflector.go:257] Listing and watching *v1.ConfigMap from pkg/provider/client.go:124 W0307 16:09:30.052278 1 reflector.go:424] pkg/provider/client.go:124: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-controller-manager:cloud-controller-manager" cannot list resource "configmaps" in API group "" at the cluster scope E0307 16:09:30.052308 1 reflector.go:140] pkg/provider/client.go:124: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-controller-manager:cloud-controller-manager" cannot list resource "configmaps" in API group "" at the cluster scope
This is a clone of issue OCPBUGS-18772. The following is the description of the original issue:
—
MCO installs resolve-prepender NetworkManager script on the nodes. In order to find out node details it needs to pull baremetalRuntimeCfgImage. However, this image needs to be pulled just the first time, in the followup attempts this script just verifies that this image is available.
This is not desirable in situations where mirror / quay are not available or having a temporary problem - these kind of issues should not prevent the node from starting kubelet. During certificate rotation testing I noticed that the node with a significant time skew won't start kubelet, as it tries to pull baremetalRuntimeCfgImage for kubelet to start - but the image is already on the nodes and it doesn't need refreshing.
Manifests are copied from the object store (either S3 or pod) into the node that is performing the role of bootstrap during installation (or to the single node in an SNO setup)
They are copied into one of two directories according to the directory into which they were uploaded to the object store.
<cluster-id>/manifests/manifests/* will end up being copied to /run/ephemeral/var/opt/openshift/manifests/
<cluster-id>/manifests/openshift/* will end up being copied to /run/ephemeral/var/opt/openshift/openshift/manifest
After this step, any files that have been written to /run/ephemeral/var/opt/openshift/openshift/ are also copied to /run/ephemeral/var/opt/openshift/manifests/, any identically named files are overwritten as part of this operation.
This behaviour is entirely expected and correct, however it does lead to an issue where if a user chooses to upload a file to both directories with identical names, for example;
File 1: <cluster-id>/manifests/manifests/manifest1.yaml
File 2: <cluster-id>/manifests/openshift/manifest1.yaml
That the only File 2 would end up being applied and that File 1 would end up being overwritten during the bootkube phase.
We should prevent this from happening by treating any attempt to introduce the same file in two places as illegal, meaning that if File 2 is present, we should prevent the upload of File 1 and vice versa during the creation/update of a manifest.
Description of problem:
Now that the bug to include libnmstate.2.2.x has been resolved (https://issues.redhat.com/browse/OCPBUGS-11659) we are seeing a boot issue in which agent-tui can't start. It looks like it is failing to find the symlink libnmstate.2 as when its run directly we see $ /usr/local/bin/agent-tui /usr/local/bin/agent-tui: error while loading shared libraries: libnmstate.so.2: cannot open shared object file: No such file or directory This results neither the console or ssh available in bootstrap which makes debugging difficult. However it does not affect the installation as we still get a successful install. The bootstrap screenshots are attached.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If the user specifies a DNS name in an egressnetworkpolicy for which the upstream server returns a truncated DNS response, openshift-sdn does not fall back to TCP as expected but just take this as a failure.
Version-Release number of selected component (if applicable):
4.11 (originally reproduced on 4.9)
How reproducible:
Always
Steps to Reproduce:
1. Setup an EgressNetworkPolicy that points to a domain where a truncated response is returned while querying via UDP. 2. 3.
Actual results:
Error, DNS resolution not completed.
Expected results:
Request retried via TCP and succeeded.
Additional info:
In comments.
Description of problem:
When the user edits a deployment and switches (just) the rollout "Strategy type" the form couldn't be saved because the Save button stays disabled.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
Actual results:
Save button stays disabled
Expected results:
Save button should enable when changing a value (that doesn't make the form state invalid)
Additional info:
Description of problem:
egressip cannot be assigned on hypershift hosted cluster node
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-09-162945
How reproducible:
100%
Steps to Reproduce:
1. setup hypershift env 2. lable egress ip node on hosted cluster % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-129-175.us-east-2.compute.internal Ready worker 3h20m v1.26.2+bc894ae ip-10-0-129-244.us-east-2.compute.internal Ready worker 3h20m v1.26.2+bc894ae ip-10-0-141-41.us-east-2.compute.internal Ready worker 3h20m v1.26.2+bc894ae ip-10-0-142-54.us-east-2.compute.internal Ready worker 3h20m v1.26.2+bc894ae % oc label node/ip-10-0-129-175.us-east-2.compute.internal k8s.ovn.org/egress-assignable="" node/ip-10-0-129-175.us-east-2.compute.internal labeled % oc label node/ip-10-0-129-244.us-east-2.compute.internal k8s.ovn.org/egress-assignable="" node/ip-10-0-129-244.us-east-2.compute.internal labeled % oc label node/ip-10-0-141-41.us-east-2.compute.internal k8s.ovn.org/egress-assignable="" node/ip-10-0-141-41.us-east-2.compute.internal labeled % oc label node/ip-10-0-142-54.us-east-2.compute.internal k8s.ovn.org/egress-assignable="" node/ip-10-0-142-54.us-east-2.compute.internal labeled 3. create egressip % cat egressip.yaml apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: name: egressip-1 spec: egressIPs: [ "10.0.129.180" ] namespaceSelector: matchLabels: env: ovn-tests % oc apply -f egressip.yaml egressip.k8s.ovn.org/egressip-1 created 4. check egressip assignment
Actual results:
egressip cannot assigned to node % oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 10.0.129.180
Expected results:
egressip can be assigned to one of the hosted cluster node
Additional info:
Description of problem:
Starting with 4.12.0-0.nightly-2023-03-13-172313, the machine API operator began receiving an invalid version tag either due to a missing or invalid VERSION_OVERRIDE(https://github.com/openshift/machine-api-operator/blob/release-4.12/hack/go-build.sh#L17-L20) value being passed tot he build. This is resulting in all jobs invoked by the 4.12 nightlies failing to install.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-03-13-172313 and later
How reproducible:
consistently in 4.12 nightlies only(ci builds do not seem to be impacted).
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Example of failure https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-csi/1635331349046890496/artifacts/e2e-aws-csi/gather-extra/artifacts/pods/openshift-machine-api_machine-api-operator-866d7647bd-6lhl4_machine-api-operator.log
Description of problem:
Command `$ oc explain route.spec.tls.insecureEdgeTerminationPolicy` shows different values than the actual values.
Version-Release number of selected component (if applicable):
4.10.z
How reproducible:
100%
Steps to Reproduce:
1. $ oc explain route.spec.tls.insecureEdgeTerminationPolicy KIND: Route VERSION: route.openshift.io/v1FIELD: insecureEdgeTerminationPolicy <string>DESCRIPTION: insecureEdgeTerminationPolicy indicates the desired behavior for insecure connections to a route. While each router may make its own decisions on which ports to expose, this is normally port 80. * Allow - traffic is sent to the server on the insecure port (default) * Disable - no traffic is allowed on the insecure port. * Redirect - clients are redirected to the secure port. 2. Set the option to 'Disable' in any secure route : $ oc edit route <route-name> spec: host: hello.example.com port: targetPort: https tls: insecureEdgeTerminationPolicy: Disable 3. After editing the route and setting `insecureEdgeTerminationPolicy: Disable` , it gives error : Danger alert:An error occurred Error "Invalid value: "Disable": invalid value for InsecureEdgeTerminationPolicy option, acceptable values are None, Allow, Redirect, or empty" for field "spec.tls.insecureEdgeTerminationPolicy".
Actual results:
Based on the API Usage information, the Disable value for insecureEdgeTerminationPolicy field is not acceptable.
Expected results:
The `oc explain route.spec.tls.insecureEdgeTerminationPolicy` must show the correct values.
Additional info:
Description of problem:
We are not error checking the response when we request console plugins in getConsolePlugins. If this request fails, we still try to access the "Items" property of the response, which is nil, and causes an exception to be trhown. We need to make sure the request succeeded before referencing any properties of the response.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Run bridge locally without setting the requisite env vars
Actual results:
A runtime exception is thrown from the getConsolePlugins function and bridge terminates
Expected results:
An error should be logged and bridge should continue to run
Additional info:
As an ODC helm backend developer I would like to be able to bump version of helm to 3.12 to stay synched up with the version we will ship with OCP 4.14
Normal activity we do every time a new OCP version is release to stay current
NA
NA
Bump version of helm to 3.12 run, build and unit test and make sure everything is working as expected. Last time we had a conflict with DevFile backend.
Might had dependencies with DevFile team to move some dependencies forward
NA
Console Helm dependency is moved to 3.12
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
Description of problem:
NAT gateway is not yet a supported feature and the current implementation is a partial non-zonal solution.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. Set OutboundType = NatGateway 2. Deploy cluster 3.
Actual results:
Install successful
Expected results:
Install requires TechPreviewNoUpgrade before proceeding
Additional info:
Description of problem:
https://github.com/openshift/openshift-docs/pull/59549#discussion_r1184195239 per the discussion here, the text in the dev console when creating a function says a func.yaml file must be present OR it must use the s2i build strategy, when in fact both things are required
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Go to +Add -> Create Serverless function and use a repo URL that doesn't fit the requirements in order to see the error
Actual results:
Expected results:
Additional info:
Version:
$ openshift-install version
./openshift-install 4.9.11
built from commit 4ee186bb88bf6aeef8ccffd0b5d4e98e9ddd895f
release image quay.io/openshift-release-dev/ocp-release@sha256:0f72e150329db15279a1aeda1286c9495258a4892bc5bf1bf5bb89942cd432de
release architecture amd64
Platform: Openstack
install type: IPI
What happened?
Image streams using the swift container to store the images, after running so many image streams I am able to see the huge number of objects in the swift container if I destroy the cluster now, it takes huge time based on the size of the swift container
What did you expect to happen?
The destroy script should clean the resources in some reasonable time
How to reproduce it (as minimally and precisely as possible)?
deploy OCP, run some workload which creates a lot of image streams and destroy the cluster, it will take a lot of time to complete the destroy cmd
Anything else we need to know?
here is the output of the swift state cmd and the time it took to complete the destroy job
$ swift stat vlan609-26jxm-image-registry-nseyclolgfgxoaiysrlejlhvoklcawbxt
Account: AUTH_2b4d979a2a9e4cf88b2509e9c5e0e232
Container: vlan609-26jxm-image-registry-nseyclolgfgxoaiysrlejlhvoklcawbxt
Objects: 723756
Bytes: 652448740473
Read ACL:
Write ACL:
Sync To:
Sync Key:
Meta Name: vlan609-26jxm-image-registry-nseyclolgfgxoaiysrlejlhvoklcawbxt
Meta Openshiftclusterid: vlan609-26jxm
Content-Type: application/json; charset=utf-8
X-Timestamp: 1640248399.77606
Last-Modified: Thu, 23 Dec 2021 08:34:48 GMT
Accept-Ranges: bytes
X-Storage-Policy: Policy-0
X-Trans-Id: txb0717d5198e344a5a095d-0061c93b70
X-Openstack-Request-Id: txb0717d5198e344a5a095d-0061c93b70
Time took to complete the destroy: 6455.42s
In case of user provides partial/empty/invalid ca certificate in the ignition endpoint override the ignitionDownloadable/API_VIP validation will fail but the user will not know why.
In the agent log we will see this error:
Failed to download worker.ign: unable to parse cert
One option to let the user know about the problem is to return the error in case of failure as part of the APIVipConnectivityResponse and present it to the user.
and use that value as part of the failing validation message.
This is a bit tricky, the current error message are not user facing and we will need to adjust them.
It also requires API changes...
Another option is to validate the parameters the user provides
Description of the problem:
While scale testing ACM 2.8, sometimes 0 of the SNOs are discovered. Upon review, the agent on the SNOs is attempting to return the inspection data to the API VIP ip address instead of the ip address of the metal3 pod (which is the node hosting the metal3 pod). Presumbly the times where the agents were discovered, the VIP API address happened to be on the same node as the metal3 pod.
How reproducible:
Roughly it should be 66% of the time you could encounter this with a 3 node cluster.
Steps to reproduce:
1.
2.
3.
Actual results:
Ironic agents attempting to access "fc00:1004::3" which is the API vip address
2023-03-12 17:52:51.441 1 CRITICAL ironic-python-agent [-] Unhandled error: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='fc00:1004::3', port=5050): Max retries exceeded with url: /v1/continue (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f94354114c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED')) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent Traceback (most recent call last): 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 169, in _new_conn 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent conn = connection.create_connection( 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/util/connection.py", line 96, in create_connection 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent raise err 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/util/connection.py", line 86, in create_connection 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent sock.connect(sa) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/eventlet/greenio/base.py", line 253, in connect 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent socket_checkerr(fd) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/eventlet/greenio/base.py", line 51, in socket_checkerr 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent raise socket.error(err, errno.errorcode[err]) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent ConnectionRefusedError: [Errno 111] ECONNREFUSED 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred: 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent Traceback (most recent call last): 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent httplib_response = self._make_request( 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 382, in _make_request 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent self._validate_conn(conn) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent conn.connect() 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 353, in connect 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent conn = self._new_conn() 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 181, in _new_conn 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent raise NewConnectionError( 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f94354114c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred: 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent Traceback (most recent call last): 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/requests/adapters.py", line 439, in send 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent resp = conn.urlopen( 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent retries = retries.increment( 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent raise MaxRetryError(_pool, url, error or ResponseError(cause)) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='fc00:1004::3', port=5050): Max retries exceeded with url: /v1/continue (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f94354114c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED')) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred: 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent Traceback (most recent call last): 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/bin/ironic-python-agent", line 10, in <module> 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent sys.exit(run()) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/ironic_python_agent/cmd/agent.py", line 50, in run 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent agent.IronicPythonAgent(CONF.api_url, 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 471, in run 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent uuid = inspector.inspect() 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py", line 106, in inspect 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent resp = call_inspector(data, failures) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py", line 145, in call_inspector 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent resp = _post_to_inspector() 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/tenacity/__init__.py", line 329, in wrapped_f 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent return self.call(f, *args, **kw) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/tenacity/__init__.py", line 409, in call 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent do = self.iter(retry_state=retry_state) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/tenacity/__init__.py", line 368, in iter 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent raise retry_exc.reraise() 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/tenacity/__init__.py", line 186, in reraise 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent raise self.last_attempt.result() 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent return self.__get_result() 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent raise self._exception 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/tenacity/__init__.py", line 412, in call 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent result = fn(*args, **kwargs) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py", line 142, in _post_to_inspector 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent return requests.post(CONF.inspection_callback_url, data=data, 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/requests/api.py", line 119, in post 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent return request('post', url, data=data, json=json, **kwargs) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/requests/api.py", line 61, in request 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent return session.request(method=method, url=url, **kwargs) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/requests/sessions.py", line 542, in request 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent resp = self.send(prep, **send_kwargs) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/requests/sessions.py", line 655, in send 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent r = adapter.send(request, **kwargs) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/requests/adapters.py", line 516, in send 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent raise ConnectionError(e, request=request) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent requests.exceptions.ConnectionError: HTTPSConnectionPool(host='fc00:1004::3', port=5050): Max retries exceeded with url: /v1/continue (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f94354114c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED')) 2023-03-12 17:52:51.441 1 ERROR ironic-python-agent
You can see the metal3 pod node and ip address:
# oc get po -n openshift-machine-api metal3-5cc95d74d8-lqd9x -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES metal3-5cc95d74d8-lqd9x 5/5 Running 0 2d16h fc00:1004::7 e27-h05-000-r650 <none> <none>
The addresses on the e27-h05-000-r650 node:
[root@e27-h05-000-r650 ~]# ip a | grep "fc00" inet6 fc00:1004::4/128 scope global nodad deprecated inet6 fc00:1004::7/64 scope global noprefixroute
You can see the api VIP is actually on this host:
[root@e27-h03-000-r650 ~]# ip a | grep "fc00" inet6 fc00:1004::3/128 scope global nodad deprecated inet6 fc00:1004::6/64 scope global noprefixroute
Expected results:
Versions:
Hub and SNO OCP 4.12.2
ACM - 2.8.0-DOWNSTREAM-2023-02-28-23-06-27
Description of problem:
nodeip-configuration.service is failed on cluster nodes:
systemctl status nodeip-configuration.service × nodeip-configuration.service - Writes IP address configuration so that kubelet and crio services select a valid node IP Loaded: loaded (/etc/systemd/system/nodeip-configuration.service; enabled; preset: disabled) Active: failed (Result: exit-code) since Tue 2023-08-15 16:28:09 UTC; 18h ago Main PID: 3709 (code=exited, status=0/SUCCESS) CPU: 237ms Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3761]: ++ [[ -z bond0.354 ]] Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3761]: ++ echo bond0.354 Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3760]: + iface=bond0.354 Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3760]: + echo 'Node IP interface determined as: bond0.354. Enabling IP forwarding...' Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3760]: Node IP interface determined as: bond0.354. Enabling IP forwarding... Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3760]: + sysctl -w net.ipv4.conf.bond0.354.forwarding=1 Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com configure-ip-forwarding.sh[3767]: sysctl: cannot stat /proc/sys/net/ipv4/conf/bond0/354/forwarding: No such file or directory Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com systemd[1]: nodeip-configuration.service: Control process exited, code=exited, status=1/FAILURE Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com systemd[1]: nodeip-configuration.service: Failed with result 'exit-code'. Aug 15 16:28:09 openshift-worker-2.lab.eng.tlv2.redhat.com systemd[1]: Failed to start Writes IP address configuration so that kubelet and crio services select a valid node IP.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-08-005757
How reproducible:
so far once
Steps to Reproduce:
1. Deploy multinode spoke cluster with GitOps-ZTP 2. Configure baremetal network to be on top of vlan interface
- name: bond0.354 description: baremetal network type: vlan state: up vlan: base-iface: bond0 id: 354 ipv4: enabled: true dhcp: false address: - ip: 10.x.x.20 prefix-length: 26 ipv6: enabled: false dhcp: false autoconf: false
Actual results:
Cluster is deployed but nodeip-configuration.service is Failed
Expected results:
nodeip-configuration.service is Active
Please review the following PR: https://github.com/openshift/thanos/pull/104
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/router/pull/473
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Pipeline metrics page crash
Pipeline metrics page should works
Always
4.14.0-0.nightly-2023-05-29-174116
It is regression after this got merged https://github.com/openshift/console/pull/12821/commits/c2d24932cd41b1b4c89d7b9fa5ca46d18b0d2d29#diff-782cbf3ae7050932e76be67d990d9cdaa02e322ea6c2b53083a677ed311ff612R40
Description of the problem:
In Staging, deleting host in UI {}> Host re{-} register after ~15 mins
How reproducible:
100%
Steps to reproduce:
1. Before cluster installation, delete random host using UI
2. Wait 15 mins
3. Host re-register without rebooting
Actual results:
Agent automatically register himself after 15 min
Expected results:
Agent should register again after reboot
Description of problem:
The test TestPrometheusRemoteWrite/assert_remote_write_cluster_id_relabel_config_works is flaky and keeps blocking PR merges. After investigation it seems like the timeout to wait for the expected value is simply to short.
Description of problem:
hypershift CLI tool allows any string for cluster name. But later when the cluster is to be imported, it needs to confirm to RFC1123. So the user needs to read the error, destroy the cluster and then try again with a proper name. This experience can be improved.
Version-Release number of selected component (if applicable):
4.13.4
How reproducible:
Always
Steps to Reproduce:
1. hypershift create cluster kubevirt --name virt-4.12 ... 2. try to import it
Actual results:
cluster fails to import due to its name
Expected results:
validate the cluster name in the hypershift cli, fail early
Additional info:
Reported by IBM.
Apparently, they run in such a way that status.Version.Desired.Version is not guaranteed to be a parseable semantic version. Thus isUpgradeble returns an error and blocks upgrade, even if the force upgrade annotation is present.
We should check for the annotation first and if the upgrade is being forced, we don't need to do the z-stream upgrade check.
https://redhat-internal.slack.com/archives/C01C8502FMM/p1689279310050439
Description of problem:
ccoctl does not prevent the user from using the same resource group name for the OIDC and installation resource groups which can result in resources existing in the resource group used for cluster installation. The OpenShift installer requires that the installation resource group be empty so OIDC and installation resource groups must be distinct. ccoctl currently allows for providing either --oidc-resource-group-name and --installation-resource-group name but does not indicate a problem when those resource group names are the same. When the same resource group name is provided using a combination of the --name, --oidc-resource-group-name and --installation-resource-group-name parameters, ccoctl should exit with an error indicating that the resource group names must be different.
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
100%
Steps to Reproduce:
1. Run ccoctl azure create-all with a combination of --name, --oidc-resource-group-name or --installation-resource-group-name resulting in OIDC and installation resource group names being the same. ./ccoctl azure create-all --name "abutchertest" --region centralus --subscription-id "${SUBSCRIPTION_ID}"--credentials-requests-dir "${MYDIR}/credreqs" --oidc-resource-group-name test "abutchertest" --dnszone-resource-group-name "${DNS_RESOURCE_GROUP}" ccoctl will default the installation resource group to match the provided --name parameter "abutchertest" which results in OIDC and installation resource groups being "abutchertest" since --oidc-resource-group uses the same name. This means that OIDC resources will be created in the resource group that will be configured for the OpenShift installer within the install-config.yaml. 2. Run the OpenShift installer having set .platform.azure.resourceGroupName in the install-config.yaml to be "abutchertest" and receive error that the installation resource group is not empty when running the installer. The resource identified will contain user-assigned managed identities meant to be created in the OIDC resource group which must be separate from the installation resource group. FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": platform.azure.resourceGroupName: Invalid value: "abutchertest": resource group must be empty but it has 8 resources like...
Actual results:
ccoctl allows OIDC and installation resource group names to be the same.
Expected results:
ccoctl does not allow OIDC and installation resource groups to be the same.
Additional info:
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/220
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Openshift Console fails to render Monitoring Dashboard when there is a Proxy expected to be used. Additionally, Websocket connections fail due to not using Proxy.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Connect to a cluster using backplane and use one of IT's proxies 2. Execute "ocm backplane console -b" 3. Attempt to view the monitoring dashbaord
Actual results:
Monitoring dashboard fails to load with an EOF error Terminal is spammed with EOF errors
Expected results:
Monitoring dashboard should be rendered correctly Terminal should not be spammed with error logs
Additional info:
When we apply changes as this PR, the monitoring dashboard works with proxy https://github.com/openshift/console/pull/12877
Description of problem:
When the OIDC provider is deleted on the customer side, AWS resource deletion is not skipped in cases that the ValidAWSIdentityProvider state is on 'Unknown'. This results in clusters being stuck during deletion.
Version-Release number of selected component (if applicable):
4.12.z, 4.13.z, 4.14.z
How reproducible:
Irregular
Steps to Reproduce:
1. 2. 3.
Actual results:
Cluster stuck in uninstallation
Expected results:
Clusters not stuck in uninstallation, AWS customer resources being skipped for removal
Additional info:
Added MG for all hypershift related NS Bug seems to be at https://github.com/openshift/hypershift/pull/2281/files#diff-f90ab1b32c9e1b349f04c32121d59f5e9081ccaf2be490f6782165d2960bc6c7R295 : 'Unknown' needs to be added to the check if OIDC is valid or not.
Description of problem:
A customer has reported that the Thanos querier pods would be OOM-killed when loading the API performance dashboard with large time ranges (e.g. >= 1 week)
Version-Release number of selected component (if applicable):
4.10
How reproducible:
Always for the customer
Steps to Reproduce:
1. Open the "API performance" dashboard in the admin console. 2. Select a time range of 2 weeks. 3.
Actual results:
The dashboard fails to refresh and the thanos-query pods are killed.
Expected results:
The dashboard loads without error.
Additional info:
The issue arises for the customer because they have very large clusters (hundreds of nodes) which generate lots of metrics. In practice the queries executed by the dashboard are costly because they access lots of series (probably > tens of thousands). To make it more efficient, the "upstream" dashboard from kubernetes-monitoring/kubernetes-mixin uses recording rules [1] instead of raw queries. While it decreases a bit the accuracy (one can only distinguish between read & write API requests), it's the only solution to avoid overloading the Thanos query endpoint. [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/05a58f765eda05902d4f7dd22098a2b870f7ca1e/dashboards/apiserver.libsonnet#L50-L75
Description of problem:
In the metric `cluster:capacity_cpu_cores:sum` there is an attribute label `label_node_role_kubernetes_io` that has `infra` or `master`. There is no label for `worker`. If the infra nodes are missing this label, they get added into the "unlabeled" worker nodes. For example: This cluster has all three types `cluster:capacity_cpu_cores:sum{_id="0702a3b1-c2d8-427f-865d-3ce7dc3a2be7"}` But this cluster has the infra and worker merged. `cluster:capacity_cpu_cores:sum{_id="0e60ac76-d61a-4e6d-a4f3-269110b6b1f9"}` If I count clusters that have sockets with infra but capacity_cpu without infra, I get 7,617 cluster for 2023-03-15 If I count clusters that have sockets with infra but capacity_cpu with infra, I get 2,015 cluster for 2023-03-15 That means that there are 5602 clusters that are missing the infra label. This metric is used to identify the vCPU/CPU count that is used in TeleSense. This is presented to the Sales teams and upper management. If there is another metric we should use, please let me know. Otherwise, this needs to be fixed.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
refer to Slack thread: https://redhat-internal.slack.com/archives/C0VMT03S5/p1678967355450719
Description of problem:
ROSA is being branded via custom branding; as a result, the favicon disappears since we do not want any Red Hat/Openshift-specific branding to appear when custom branding is in use. Since ROSA is a Red Hat product, it should get a branding option added to the console so all the correct branding including favicon appears.
Version-Release number of selected component (if applicable):
4.14.0, 4.13.z, 4.12.z, 4.11.z
How reproducible:
Always
Steps to Reproduce:
1. View a ROSA cluster 2. Note the absence of the OpenShift logo favicon
Description of problem:
Daemonset cni-sysctl-allowlist-ds is missing annotation for workload partitioning.
Version-Release number of selected component (if applicable):
How reproducible:
Executing the daemonset shows the pod missing the workload annotation
Steps to Reproduce:
1. Run Daemonset 2. 3.
Actual results:
No workload annotation present.
Expected results:
annotations: target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
Additional info:
Description of problem:
vSphere dual-stack added support for both IPv4 and IPv6 in kubelet --node-ip
however the masters are booting without the IPv6 address in --node-ip
"Ignoring filtered route {Ifindex: 2 Dst: <nil> Src: 192.168.130.19 Gw: 192.168.130.1 Flags: [] Table: 254}" "Ignoring filtered route {Ifindex: 2 Dst: 192.168.130.0/24 Src: 192.168.130.19 Gw: <nil> Flags: [] Table: 254}" "Ignoring filtered route {Ifindex: 2 Dst: fd65:a1a8:60ad:271c::22/128 Src: <nil> Gw: <nil> Flags: [] Table: 254}" "Ignoring filtered route {Ifindex: 2 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254}" "Ignoring filtered route {Ifindex: 2 Dst: <nil> Src: <nil> Gw: fe80::9eb4:f9fa:2b8d:8372 Flags: [] Table: 254}" "Writing Kubelet service override with content [Service]\nEnvironment=\"KUBELET_NODE_IP=192.168.130.19\" \"KUBELET_NODE_IPS=192.168.130.19\"\n"
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-28-154013
How reproducible:
Intermittent (DHCPv6 related)
Steps to Reproduce:
1. install vsphere dual-stack IPI with DHCPv6 networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 - cidr: fd65:10:128::/56 hostPrefix: 64 machineNetwork: - cidr: 192.168.0.0/16 - cidr: fd65:a1a8:60ad:271c::/64 networkType: OVNKubernetes
Actual results:
Masters missing IPv6 address in KUBELET_NODE_IPS
Install fails with
time="2023-08-30T19:54:19Z" level=error msg="failed to initialize the cluster: Cluster operators authentication, console, ingress, monitoring are not available"
Expected results:
Both IPv4 and IPv6 address in KUBELET_NODE_IPS
Install succeeds
Additional info:
Do we set ipv6.may-fail with NetworkManager?
Description of problem:
After upgrading a plugin image the browser continues to request old plugin files
How reproducible:
100%
Steps to Reproduce:
1. Build and deploy a plugin generated from console-plugin-template repo
2. open one of the plugin pages in the browser
4. Make a change in the code of that page, rebuild and deploy a new image
5. Try to view this page in firefox - you'll get a 404 error. In chrome you'll get the old page
The root cause is
The plugin js file names are auto generated, so the new image has different js file names.
But the plugin-entry.js filename remains the same, the file is cached by default and continues to request the old files
Description of problem: The openshift-manila-csi-driver namespace should have the "workload.openshift.io/allowed= management" label.
This is currently not the case:
❯ oc describe ns openshift-manila-csi-driver Name: openshift-manila-csi-driver Labels: kubernetes.io/metadata.name=openshift-manila-csi-driver pod-security.kubernetes.io/audit=privileged pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=privileged Annotations: include.release.openshift.io/self-managed-high-availability: true openshift.io/node-selector: openshift.io/sa.scc.mcs: s0:c24,c4 openshift.io/sa.scc.supplemental-groups: 1000560000/10000 openshift.io/sa.scc.uid-range: 1000560000/10000 Status: Active No resource quota. No LimitRange resource.
It is causing CI jobs to fail with:
{ fail [github.com/openshift/origin/test/extended/cpu_partitioning/platform.go:82]: projects [openshift-manila-csi-driver] do not contain the annotation map[workload.openshift.io/allowed:management] Expected <[]string | len:1, cap:1>: [ "openshift-manila-csi-driver", ] to be empty Ginkgo exit error 1: exit with code 1}
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
thanos-sidecar is panicking after the image was rebuilt in this payload https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-04-18-045408 Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm/1648276769645531136 Logs: - containerID: cri-o://c62dcc73b8203bfd968ffca95bba8607e24a06492948a0179cde6a57a897d431 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a007b49153ee517ab4fe0600d217832bac0fd6152b5a709da291b60c82a5875d imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a007b49153ee517ab4fe0600d217832bac0fd6152b5a709da291b60c82a5875d lastState: terminated: containerID: cri-o://c62dcc73b8203bfd968ffca95bba8607e24a06492948a0179cde6a57a897d431 exitCode: 2 finishedAt: '2023-04-18T12:30:20Z' message: "panic: Something in this program imports go4.org/unsafe/assume-no-moving-gc\ \ to declare that it assumes a non-moving garbage collector, but your version\ \ of go4.org/unsafe/assume-no-moving-gc hasn't been updated to assert that\ \ it's safe against the go1.20 runtime. If you want to risk it, run with\ \ environment variable ASSUME_NO_MOVING_GC_UNSAFE_RISK_IT_WITH=go1.20 set.\ \ Notably, if go1.20 adds a moving garbage collector, this program is unsafe\ \ to use.\n\ngoroutine 1 [running]:\ngo4.org/unsafe/assume-no-moving-gc.init.0()\n\ \t/go/src/github.com/improbable-eng/thanos/vendor/go4.org/unsafe/assume-no-moving-gc/untested.go:25\ \ +0x1ba\n" reason: Error startedAt: '2023-04-18T12:30:20Z' name: thanos-sidecar ready: false restartCount: 14 started: false state: waiting: message: back-off 5m0s restarting failed container=thanos-sidecar pod=prometheus-k8s-0_openshift-monitoring(bafeb85b-3980-4153-90bc-a302b93c3465) reason: CrashLoopBackOff
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-04-18-045408
How reproducible:
Always
Steps to Reproduce:
1. Install 4.14.0-0.nightly-2023-04-18-045408
Actual results:
thanos-sidecar panics and cluster doesn't install
Expected results:
Additional info:
Description of problem:
Deployed a OCP cluster using hypershift agent with 4.14.0-ec.4 release version on Power. We are observing that when loading operator hub page in GUI is throwing 404 error
Version-Release number of selected component (if applicable):
OCP 4.14.0-ec.4
How reproducible:
Every time
Steps to Reproduce:
1. Deploy Hypershift cluster 2. Go to GUI and check OperatorHub 3.
Actual results:
OperatorHub page in GUI is throwing 404 error
Expected results:
OperatorHub page should show Operators
Additional information:
Failure status in olm operator pod from management cluster:
# oc get pod olm-operator-754779f559-846tw -n clusters-hypershift-015 -oyaml message: | 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')" monitor=clusteroperator time="2023-08-17T10:58:37Z" level=error msg="initialization error - failed to ensure name=\"\" - ClusterOperator.config.openshift.io \"\\\"\\\"\" is invalid: metadata.name: Invalid value: \"\\\"\\\"\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')" monitor=clusteroperator time="2023-08-17T10:59:37Z" level=error msg="initialization error - failed to ensure name=\"\" - ClusterOperator.config.openshift.io \"\\\"\\\"\" is invalid: metadata.name: Invalid value: \"\\\"\\\"\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')" monitor=clusteroperator time="2023-08-17T11:00:37Z" level=error msg="initialization error - failed to ensure name=\"\" - ClusterOperator.config.openshift.io \"\\\"\\\"\" is invalid: metadata.name: Invalid value: \"\\\"\\\"\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')" monitor=clusteroperator I0817 11:01:33.000390 1 trace.go:205] Trace[2006040218]: "DeltaFIFO Pop Process" ID:system:controller:route-controller,Depth:152,Reason:slow event handlers blocking the queue (17-Aug-2023 11:01:28.947) (total time: 456ms): Trace[2006040218]: [456.950035ms] [456.950035ms] END 2023/08/17 11:01:41 http: TLS handshake error from 10.244.0.10:33355: read tcp 172.17.53.0:8443->10.244.0.10:33355: read: connection reset by peer reason: Error startedAt: "2023-08-14T11:03:46Z"
Screenshot: https://drive.google.com/file/d/1I_XkX15xEl9ZBtAIZ2yp70twD4z2ASlS/view?usp=sharing
Must gather logs:
https://drive.google.com/file/d/1AkmzC_TUi9z6p13funrSygBm2CgepbpU/view?usp=sharing
Description of problem:
maxUnavailable defaults to 50% for anything under 4: https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/poddisruptionbudget.go#L71 Based on PDB rounding logic, it always rounds to the next while integer, so 1.5 becomes 2. spec: maxUnavailable: 50% selector: matchLabels: ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default currentHealthy: 3 desiredHealthy: 1 disruptionsAllowed: 2 Where as with 4 router pods, we only allow 1 of 4 to be disrupted at a time.
Version-Release number of selected component (if applicable):
4.x
How reproducible:
Always
Steps to Reproduce:
1. Set 3 replicas 2. Look at the disruptionsAllowed on the PDB
Actual results:
You can take down 2 of 3 routers at once, leaving no HA.
Expected results:
With 3+ routers, we should always ensure 2 are up with the PDB.
Additional info:
Reduce the maxUnavailable to 25% for >= 3 pods instead of 4
Description of problem:y
An empty page returned when normal user try to view Route Metrics page
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-13-223353
How reproducible:
Always
Steps to Reproduce:
1. Check any Routes metrics page with cluster-admin user, for example /k8s/ns/openshift-monitoring/routes/alertmanager-main/metrics, we can see the route metrics page and charts are loaded successfully 2. Grant normal user admin permission on 'openshift-monitoring' project $ oc adm policy add-role-to-user admin testuser-1 -n openshift-monitoring clusterrole.rbac.authorization.k8s.io/admin added: "testuser-1" 3. Login with normal user 'testuser-1' and check Networking -> Routes -> alertmanager-main -> Metrics page again
Actual results:
3. empty page is returned
Expected results:
3. If normal user doesn't have ability to view Route Metrics, we should better either hide 'Metrics' tab or show an error message instead of totally empty page
Additional info:
Description of problem:
The operator catalog images used in 4.13 hosted clusters are the ones from 4.12
Version-Release number of selected component (if applicable):
4.13.z
How reproducible:
Always
Steps to Reproduce:
1. Create a 4.13 HostedCluster 2. Inspect the image tags used for catalog imagestreams (oc get imagestreams -n CONTROL_PLANE_NAMESPACE)
Actual results:
image tags point to 4.12 catalog images
Expected results:
image tags point to 4.13 catalog images
Additional info:
These image tags need to be updated: https://github.com/openshift/hypershift/blob/release-4.13/control-plane-operator/controllers/hostedcontrolplane/olm/catalogs.go#L117-L120
In order to ship a high quality Azure CCM we want to downstream important bugfixes that were recently merged upstream.
Description of problem:
The MCO must have compatibility in place one OCP version in advance if we want to bump ignition spec version, otherwise downgrades will fail. This is NOT needed in 4.14, only 4.13
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. None atm, this is preventative for the future 2. 3.
Actual results:
N/A
Expected results:
N/A
Additional info:
As part of single run, we are basically fetching same thing over and over again and hence using API calls that should not even be made.
For example:
1. privilges check verifies permissions of datasore which is also verified by storageclass check. What is more each of those checks fetches datacenter and datastore and results in several duplication API calls.
Exit Critirea:
1. Remove duplicate checks
2. Avoid fetching same API object again and again as part of same system check.
Description of the problem:
In staging, BE 2.18.0, using UI trying to create new cluster with P/Z cpu arch. and OCP 4.10 - getting the following response :
Non x86_64 CPU architectures for version 4.10 are supported only with User Managed Networking
How reproducible:
100%
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Message should be clearer for the user to understand the issue:
p/Z cpu arch. is only supported with OCP ver >= 4.12
Description of problem:
2022-09-12T13:48:57.505323919Z {"level":"info","ts":1662990537.5052269,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"qe2/master-1-0"} 2022-09-12T13:48:57.566917845Z {"level":"info","ts":1662990537.5668473,"logger":"provisioner.ironic","msg":"no node found, already deleted","host":"qe2~master-1-0"} 2022-09-12T13:48:57.566945972Z {"level":"info","ts":1662990537.566904,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"qe2/master-1-0","provisioningState":"available","requeue":true,"after":600} 2022-09-12T13:49:13.556690278Z {"level":"info","ts":1662990553.556591,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"qe2/master-1-0"} 2022-09-12T13:49:13.614818643Z {"level":"info","ts":1662990553.6147015,"logger":"controllers.HostFirmwareSettings","msg":"retrieving firmware settings and saving to resource","hostfirmwaresettings":"qe2/master-1-0","node":"48d24898-1911-4f43-82b0-0b15f8484ae7"} 2022-09-12T13:49:13.629455616Z {"level":"info","ts":1662990553.6293764,"logger":"controllers.HostFirmwareSettings","msg":"provisioner returns error","hostfirmwaresettings":"qe2/master-1-0","RequeueAfter:":30}
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Detach a BMH 2. Check BMO logs for errors 3. Check Ironic logs for errors
Actual results:
BMO and Ironic logs have errors related to the already deleted node.
Expected results:
No noise in the logs.
Additional info:
Description of problem:
tested https://issues.redhat.com/browse/OCPBUGS-10387 with PR
launch 4.14-ci,openshift/cluster-monitoring-operator#1926 no-spot
3 masters, 3 workers, each node is with 4 cpus, no infra node
$ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-132-193.us-east-2.compute.internal Ready control-plane,master 23m v1.26.2+d2e245f ip-10-0-135-65.us-east-2.compute.internal Ready control-plane,master 23m v1.26.2+d2e245f ip-10-0-149-72.us-east-2.compute.internal Ready worker 14m v1.26.2+d2e245f ip-10-0-158-0.us-east-2.compute.internal Ready worker 14m v1.26.2+d2e245f ip-10-0-229-135.us-east-2.compute.internal Ready worker 17m v1.26.2+d2e245f ip-10-0-234-36.us-east-2.compute.internal Ready control-plane,master 23m v1.26.2+d2e245f
labels see below
control-plane: node-role.kubernetes.io/control-plane: "" master: node-role.kubernetes.io/master: "" worker: node-role.kubernetes.io/worker: ""
search with "cluster:capacity_cpu_cores:sum" on admin console "Observe -> Metrics", label_node_role_kubernetes_io=master and label_node_role_kubernetes_io="" are both calculated twice
Name label_beta_kubernetes_io_instance_type label_kubernetes_io_arch label_node_openshift_io_os_id label_node_role_kubernetes_io prometheus Value cluster:capacity_cpu_cores:sum m6a.xlarge amd64 rhcos openshift-monitoring/k8s 12 cluster:capacity_cpu_cores:sum m6a.xlarge amd64 rhcos master openshift-monitoring/k8s 12 cluster:capacity_cpu_cores:sum m6a.xlarge amd64 rhcos openshift-monitoring/k8s 12 cluster:capacity_cpu_cores:sum m6a.xlarge amd64 rhcos master openshift-monitoring/k8s 12
checked from thanos-querier API, same result with that from console UI(console UI used thanos-querier API)
$ token=`oc create token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=cluster:capacity_cpu_cores:sum' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "cluster:capacity_cpu_cores:sum", "label_beta_kubernetes_io_instance_type": "m6a.xlarge", "label_kubernetes_io_arch": "amd64", "label_node_openshift_io_os_id": "rhcos", "prometheus": "openshift-monitoring/k8s" }, "value": [ 1682394655.248, "12" ] }, { "metric": { "__name__": "cluster:capacity_cpu_cores:sum", "label_beta_kubernetes_io_instance_type": "m6a.xlarge", "label_kubernetes_io_arch": "amd64", "label_node_openshift_io_os_id": "rhcos", "label_node_role_kubernetes_io": "master", "prometheus": "openshift-monitoring/k8s" }, "value": [ 1682394655.248, "12" ] }, { "metric": { "__name__": "cluster:capacity_cpu_cores:sum", "label_beta_kubernetes_io_instance_type": "m6a.xlarge", "label_kubernetes_io_arch": "amd64", "label_node_openshift_io_os_id": "rhcos", "prometheus": "openshift-monitoring/k8s" }, "value": [ 1682394655.248, "12" ] }, { "metric": { "__name__": "cluster:capacity_cpu_cores:sum", "label_beta_kubernetes_io_instance_type": "m6a.xlarge", "label_kubernetes_io_arch": "amd64", "label_node_openshift_io_os_id": "rhcos", "label_node_role_kubernetes_io": "master", "prometheus": "openshift-monitoring/k8s" }, "value": [ 1682394655.248, "12" ] } ] } }
no such issue if we query the expr for "cluster:capacity_cpu_cores:sum" directly
Name label_beta_kubernetes_io_instance_type label_kubernetes_io_arch label_node_openshift_io_os_id label_node_role_kubernetes_io prometheus Value cluster:capacity_cpu_cores:sum m6a.xlarge amd64 rhcos openshift-monitoring/k8s 12 cluster:capacity_cpu_cores:sum m6a.xlarge amd64 rhcos master openshift-monitoring/k8s 12
should do deduplication for thanos-querier API
Version-Release number of selected component (if applicable):
tested https://issues.redhat.com/browse/OCPBUGS-10387 with PR
How reproducible:
always
Steps to Reproduce:
1. see the description 2. 3.
Actual results:
node role is calculated twice in thanos-querier API
Expected results:
node role should be calculated only once in thanos-querier API
Description of problem:
When updating s390x cluster from 4.10.35 to 4.11.34, i got following message in the UI: Updating this cluster to 4.11.34 is supported, but not recommended as it might not be optimized for some components in this cluster. Exposure to KeepalivedMulticastSkew is unknown due to an evaluation failure: client-side throttling: only 9m20.476632575s has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution On OpenStack, oVirt, and vSphere infrastructure, updates to 4.11 can cause degraded cluster operators as a result of a multicast-to-unicast keepalived transition, until all nodes have updated to 4.11. https://access.redhat.com/solutions/7007826 As we discussed on Slack[1] message could be more user friendly, something like this[2]: "Throttling risk evaluation, 2 risks to evaluate, next evaluation in 9m59s." [1] https://redhat-internal.slack.com/archives/CEGKQ43CP/p1683621220358259 [2] https://redhat-internal.slack.com/archives/CEGKQ43CP/p1683643286581299?thread_ts=1683621220.358259&cid=CEGKQ43CP
Version-Release number of selected component (if applicable):
4.11.34
How reproducible:
Have a cluster on 4.10.35 or i guess any 4.10.z and update to 4.11.34
Steps to Reproduce:
1. Open webconsole 2. On the dashboard/Overview click on Update cluster 3. Change the channel to stable-4.11 4. Select new version and from the drop down menu click on Include supported but not recommended versions 5. Select 4.11.34 6. Message from the problem description appears
Actual results:
Unclear message
Expected results:
Clear message
Description of problem:
etcd-backup fails with 'FIPS mode is enabled, but the required OpenSSL library is not available' on 4.13 FIPS enabled cluster
Version-Release number of selected component (if applicable):
OCP 4.13
How reproducible:
Steps to Reproduce:
1. run etcd-backup script on FIPS enabled OCP 4.13 2. 3.
Actual results:
backup script fails with + etcdctl snapshot save /home/core/assets/backup/snapshot_2023-08-28_125218.db FIPS mode is enabled, but the required OpenSSL library is not available
Expected results:
successful run of etcd-backup script
Additional info:
4.13 uses RHEL9-based RHCOS while ETCD image still use RHEL8 and this could be main issue. If so, image should be rebuilt with RHEL9.
Description of problem:
STS cluster awareness was in techpreview for testing and assurance of quality before release. The created unit tests and runs have indicated no change in operation to the cluster. QE has reported several bugs and they've been fixed. A periodic e2e test to verify that when an STS cluster is detected and proper AWS resource access tokens are present in the CredentialsRequest a Secret is generated has been passing and has passed when run manually on several follow-on PRs.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The Azure CCM will panic when it loses its leader election lease. This is contrary to the behaviour of other components which exit intentionally. See https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure-modern/1632791244243472384
Version-Release number of selected component (if applicable):
How reproducible:
Force the CCM to lose leader election, can happen during upgrades
Steps to Reproduce:
1. 2. 3.
Actual results:
Code will panic, eg E0306 18:09:14.315039 1 runtime.go:77] Observed a panic: leaderelection lost goroutine 1 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1adc660?, 0x219b9c0}) /go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x81e22e?}) /go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x1adc660, 0x219b9c0}) /usr/lib/golang/src/runtime/panic.go:884 +0x212 sigs.k8s.io/cloud-provider-azure/cmd/cloud-controller-manager/app.NewCloudControllerManagerCommand.func1.1() /go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/app/controllermanager.go:138 +0x27 k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1() /go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:203 +0x1f k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0002c0d80, {0x21bce08, 0xc0001ac008}) /go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:213 +0x14d k8s.io/client-go/tools/leaderelection.RunOrDie({0x21bce08, 0xc0001ac008}, {{0x21c0e00, 0xc0002c0c60}, 0x1fe5d61a00, 0x18e9b26e00, 0x60db88400, {0xc000418080, 0x1fc4978, 0x0}, ...}) /go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:226 +0x94 sigs.k8s.io/cloud-provider-azure/cmd/cloud-controller-manager/app.NewCloudControllerManagerCommand.func1(0xc000170000?, {0x1ea43e2?, 0xd?, 0xd?}) /go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/app/controllermanager.go:130 +0x3a7 github.com/spf13/cobra.(*Command).execute(0xc000170000, {0xc00019e010, 0xd, 0xd}) /go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:876 +0x67b github.com/spf13/cobra.(*Command).ExecuteC(0xc000170000) /go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:990 +0x3bd github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:918 main.main() /go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/controller-manager.go:47 +0xc5 panic: leaderelection lost [recovered] panic: leaderelection lost goroutine 1 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x81e22e?}) /go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd7 panic({0x1adc660, 0x219b9c0}) /usr/lib/golang/src/runtime/panic.go:884 +0x212 sigs.k8s.io/cloud-provider-azure/cmd/cloud-controller-manager/app.NewCloudControllerManagerCommand.func1.1() /go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/app/controllermanager.go:138 +0x27 k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1() /go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:203 +0x1f k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0002c0d80, {0x21bce08, 0xc0001ac008}) /go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:213 +0x14d k8s.io/client-go/tools/leaderelection.RunOrDie({0x21bce08, 0xc0001ac008}, {{0x21c0e00, 0xc0002c0c60}, 0x1fe5d61a00, 0x18e9b26e00, 0x60db88400, {0xc000418080, 0x1fc4978, 0x0}, ...}) /go/src/github.com/openshift/cloud-provider-azure/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:226 +0x94 sigs.k8s.io/cloud-provider-azure/cmd/cloud-controller-manager/app.NewCloudControllerManagerCommand.func1(0xc000170000?, {0x1ea43e2?, 0xd?, 0xd?}) /go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/app/controllermanager.go:130 +0x3a7 github.com/spf13/cobra.(*Command).execute(0xc000170000, {0xc00019e010, 0xd, 0xd}) /go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:876 +0x67b github.com/spf13/cobra.(*Command).ExecuteC(0xc000170000) /go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:990 +0x3bd github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/cloud-provider-azure/vendor/github.com/spf13/cobra/command.go:918 main.main() /go/src/github.com/openshift/cloud-provider-azure/cmd/cloud-controller-manager/controller-manager.go:47 +0xc5
Expected results:
Code should exit without panicking
Additional info:
Description of problem:
The modal displayed when installing a Helm chart shows a Documentation link field. This field can't be ever populated with a value and is always N/A Annotation for documentation URL doesn't exist in https://github.com/redhat-certification/chart-verifier/blob/main/docs/helm-chart-annotations.md#provider-annotations
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Go to Helm chart catalog 2. View any chart 3. See documentation = "N/A"
Actual results:
N/A
Expected results:
A way to populate the value
Additional info:
The value is consumed here: https://github.com/openshift/console/blob/2e8624014065d09ba40164221dd612d882f20395/frontend/packages/console-shared/src/components/catalog/details/CatalogDetailsPanel.tsx But it is never extracted from a chart: https://github.com/openshift/console/blob/2e8624014065d09ba40164221dd612d882f20395/frontend/packages/helm-plugin/src/catalog/utils/catalog-utils.tsx#L138 It is probably because no such annotation exists in chart certification requirements/recommendations: https://github.com/redhat-certification/chart-verifier/blob/main/docs/helm-chart-annotations.md#provider-annotations
This is a clone of issue OCPBUGS-19674. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
When using a route to expose the API server endpoint in a HostedCluster, the .status.controlPlaneEndpoint.port is reported as 6443 (the internal port) instead of 443 which is the port that is externally exposed via the route.
How reproducible:
Always
Steps to Reproduce:
1. Create a HostedCluster with a custom dns name using route as the strategy 3. Inspect .status.controlPlaneEndpoint
Actual results:
It has 6443 as the port
Expected results:
It has 443 as the port
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/188
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
based on bugs from ART team, example: https://issues.redhat.com/browse/OCPBUGS-12347, 4.14 image should be built with go 1.20, but prometheus container image is built by go1.19.6
$ token=`oc create token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/goversion/values' | jq { "status": "success", "data": [ "go1.19.6", "go1.20.3" ] }
searched from thanos API
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query={__name__=~".*",goversion="go1.19.6"}' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "prometheus_build_info", "branch": "rhaos-4.14-rhel-8", "container": "kube-rbac-proxy", "endpoint": "metrics", "goarch": "amd64", "goos": "linux", "goversion": "go1.19.6", "instance": "10.128.2.19:9092", "job": "prometheus-k8s", "namespace": "openshift-monitoring", "pod": "prometheus-k8s-0", "prometheus": "openshift-monitoring/k8s", "revision": "fe01b9f83cb8190fc8f04c16f4e05e87217ab03e", "service": "prometheus-k8s", "tags": "unknown", "version": "2.43.0" }, "value": [ 1682576802.496, "1" ] }, ...
prometheus-k8s-0 container name: [prometheus config-reloader thanos-sidecar prometheus-proxy kube-rbac-proxy kube-rbac-proxy-thanos], prometheus image is built with go1.19.6
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- prometheus --version prometheus, version 2.43.0 (branch: rhaos-4.14-rhel-8, revision: fe01b9f83cb8190fc8f04c16f4e05e87217ab03e) build user: root@402ffbe02b57 build date: 20230422-00:43:08 go version: go1.19.6 platform: linux/amd64 tags: unknown $ oc -n openshift-monitoring exec -c config-reloader prometheus-k8s-0 -- prometheus-config-reloader --version prometheus-config-reloader, version 0.63.0 (branch: rhaos-4.14-rhel-8, revision: ce71a7d) build user: root build date: 20230424-15:53:51 go version: go1.20.3 platform: linux/amd64 $ oc -n openshift-monitoring exec -c thanos-sidecar prometheus-k8s-0 -- thanos --version thanos, version 0.31.0 (branch: rhaos-4.14-rhel-8, revision: d58df6d218925fd007e16965f50047c9a4194c42) build user: root@c070c5e6af32 build date: 20230422-00:44:21 go version: go1.20.3 platform: linux/amd64 # owned by oauth team, not responsible by Monitoring $ oc -n openshift-monitoring exec -c prometheus-proxy prometheus-k8s-0 -- oauth-proxy --version oauth2_proxy was built with go1.18.10 # below isssue is tracked by bug OCPBUGS-12821 $ oc -n openshift-monitoring exec -c kube-rbac-proxy prometheus-k8s-0 -- kube-rbac-proxy --version Kubernetes v0.0.0-master+$Format:%H$ $ oc -n openshift-monitoring exec -c kube-rbac-proxy-thanos prometheus-k8s-0 -- kube-rbac-proxy --version Kubernetes v0.0.0-master+$Format:%H$
should fix files
https://github.com/openshift/prometheus/blob/master/.ci-operator.yaml#L4
https://github.com/openshift/prometheus/blob/master/Dockerfile.ocp#L1
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-04-26-154754
How reproducible:
always
Actual results:
4.14 prometheus is built with go1.19.6
Expected results:
4.14 prometheus image should be built with go1.20
Additional info:
no functional impact
Along with external disruption tests via api DNS we should also check that apiserver is not disrupted via api-int and service network endpoints
Description of problem:
The CCMs at the moment are given RBAC permissions of "get, list, watch" on secrets across all namespaces. This was a security concern raised by the OpenShift Security team. In Nutanix CCM, it currently creates a secrets informer and a configmaps informer at the cluster scope, these are then passed into the NewProvider call for the prism environment. Within the prism environment, the configmap and secret informers are used once each, and only to list a single namespace. We should modify the informers creation to limit to just the namespaces required? This would reduce the scope of RBAC required and meet the OpenShift security requirements.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-19356. The following is the description of the original issue:
—
Backport facilitator for linked issue.
Description of problem:
Fixed by @wking, opening bug for Jira linking. The cluster-dns-operator sets the status condition's lastTransitionTime whenever the status (true, false, unknown), reason, or message changed on a condition. It should only set the lastTransitionTime if the condition status changes. Otherwise this can have an affect on status flapping between true and false. See https://github.com/openshift/api/blob/master/config/v1/types_cluster_operator.go#L129
Version-Release number of selected component (if applicable):
4.15 and earlier
How reproducible:
100%
Steps to Reproduce:
1. Put cluster-dns-operator in a Degraded condition by stopping a pod, notice the lastTransitionTime 2. Wait 1 second and stop another pod, which only updates the condition message
Actual results:
Notice the lastTransitionTime for the Degraded condition changes when the message changes, even though the status is still Degraded=true
Expected results:
The lastTransitionTime should not change unless the Degraded status changes, not the message or reason.
Additional info:
Description of problem:
# QE prow CI job update hostedcluster.spec.pullSecret for some qe catalog source configurations. 4.13 jobs failed with error msg: Error from server (HostedCluster.spec.pullSecret.name: Invalid value: "9509a26c339de31aa3c9-pull-secret-new": Attempted to change an immutable field): admission webhook "hostedclusters.hypershift.openshift.io" denied the request: HostedCluster.spec.pullSecret.name: Invalid value: "9509a26c339de31aa3c9-pull-secret-new": Attempted to change an immutable field
Version-Release number of selected component (if applicable):
4.13
How reproducible:
4.13 job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/41339/rehearse-41339-periodic-ci-openshift-openshift-tests-private-release-4.13-amd64-nightly-aws-ipi-ovn-hypershift-guest-p1-f7/1689831180221812736
Steps to Reproduce:
see the above job
Actual results:
job failed to config pull secret for hostedcluster
Expected results:
job could run successfully
Additional info:
1. The 4.14 hypershift QE CI jobs were successfully executed with the same codes. 2. I can update 4.13 hostedcluster spec.pullSecret in my local hypershift env. It seems to be caused by some limitation only in prow?
slack thread: https://redhat-internal.slack.com/archives/C01C8502FMM/p1691736890938529
Description of problem:
TRT has unfortunately had to revert this breaking change to get CI and/or nightly payloads flowing again. The original PR was https://github.com/openshift/cluster-storage-operator/pull/381. The revert PR: https://github.com/openshift/cluster-storage-operator/pull/384 The following evidence helped us pushing for the revert: In the nightly payload runs, periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm has been consistently failing in the last three nightly payloads. But the run in the revert PR passed. To restore your change, create a new PR that reverts the revert and layers additional separate commit(s) on top that addresses the problem. Contact information for TRT is available at https://source.redhat.com/groups/public/atomicopenshift/atomicopenshift_wiki/how_to_contact_the_technical_release_team. Please reach out if you need assistance in relanding your change or have feedback about this process.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When a machine is created with a compute availability zone (defined via mpool.zones) and a storage root volume (defined as mpool.rootVolume) and that rootVolume has no specified zones, CAPO will use the compute AZ for the volume AZ. This can be problematic if the AZ doesn't exist in Cinder. Source: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/9d183bd479fe9aed4f6e7ac3d5eee46681c518e7/pkg/cloud/services/compute/instance.go#L439-L442
Version-Release number of selected component (if applicable):
All versions supporting rootVolume AZ.
Steps to Reproduce:
1. In install-config.yaml, add "zones" with valid Nova AZs, and a rootVolume without "zones". Your OpenStack cloud must not have Cinder AZs (only Nova AZs) 2. Day 1 deployment will go fine, Terraform will create the machines with no AZ. 3. Day 2 operation on machines will fail since CAPO tries to use the Nova AZ for the root volume if no volume AZ is provided, but since the AZ don't match between Cinder & Nova, the machine will never be created
Actual results:
Machine not created
Expected results:
Machine created in the right AZ for both Nova & Cinder
Description of problem:
- Calico Virtual NICs should be excluded from node_exporter collector. - All NICs beginning with cali* should be added to collector.netclass.ignored-devices to ensure that metrics are not collected. - node_exporter is meant to collect metrics for physical interfaces only.
Version-Release number of selected component (if applicable):
OpenShift 4.12
How reproducible:
Always
Steps to Reproduce:
Run an OpenShift cluster using Calico SDN. Observe -> Metrics -> Run the following PromQL query: "group by(device) (node_network_info)" Observe that Calico Virtual NICs present.
Actual results:
Calico Virtual NICs present in OCP Metrics.
Expected results:
Only physical network interfaces should be present.
Additional info:
Similar to this bug, but for Calico virtual NICs: https://issues.redhat.com/browse/OCPBUGS-1321
We've removed SR-IOV code that was using python-grpcio and python-protobuf. These are gone from Python's requirements.txt, but we never removed them from RPM spec we use to build Kuryr in OpenShift. This should be fixed.
When updating from 4.12 to 4.13, the incoming ovn-k8s-cni-overlay expects RHEL 9, and fails to run on the still-RHEL-8 4.12 nodes.
4.13 and 4.14 ovn-k8s-cni-overlay vs. 4.12 RHCOS's RHEL 8.
100%
Picked up in TestGrid.
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade/1677232369326624768/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/nodes/ci-op-y7r1x9z3-3a480-9swt7-master-2/journal | zgrep dns-operator | tail -n1 Jul 07 12:34:30.202100 ci-op-y7r1x9z3-3a480-9swt7-master-2 kubenswrapper[2168]: E0707 12:34:30.201720 2168 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"dns-operator-78cbdc89fd-kckcd_openshift-dns-operator(5c97a52b-f774-40ae-8c17-a17b30812596)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"dns-operator-78cbdc89fd-kckcd_openshift-dns-operator(5c97a52b-f774-40ae-8c17-a17b30812596)\\\": rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-operator-78cbdc89fd-kckcd_openshift-dns-operator_5c97a52b-f774-40ae-8c17-a17b30812596_0(1fa1dd2b35100b0f1ec058d79042a316b909e38711fcadbf87bd9a1e4b62e0d3): error adding pod openshift-dns-operator_dns-operator-78cbdc89fd-kckcd to CNI network \\\"multus-cni-network\\\": plugin type=\\\"multus\\\" name=\\\"multus-cni-network\\\" failed (add): [openshift-dns-operator/dns-operator-78cbdc89fd-kckcd/5c97a52b-f774-40ae-8c17-a17b30812596:ovn-kubernetes]: error adding container to network \\\"ovn-kubernetes\\\": netplugin failed: \\\"/var/lib/cni/bin/ovn-k8s-cni-overlay: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /var/lib/cni/bin/ovn-k8s-cni-overlay)\\\\n/var/lib/cni/bin/ovn-k8s-cni-overlay: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /var/lib/cni/bin/ovn-k8s-cni-overlay)\\\\n\\\"\"" pod="openshift-dns-operator/dns-operator-78cbdc89fd-kckcd" podUID=5c97a52b-f774-40ae-8c17-a17b30812596
Successful update.
Both 4.14 and 4.13 control planes can be associated with 4.12 compute nodes, because of EUS-to-EUS updates.
This is a clone of issue OCPBUGS-19550. The following is the description of the original issue:
—
Multus doesn't need to watch pods on other nodes. To save memory and CPU set MULTUS_NODE_NAME to filter pods that multus watches.
Description of problem: Multus currently implements a certificate that exists for 10 minutes, we need to add configuration for certificates for 24 hours
Description of problem:
Similar to OCPBUGS-11636 ccoctl needs to be updated to account for the s3 bucket changes described in https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/ these changes have rolled out to us-east-2 and China regions as of today and will roll out to additional regions in the near future See OCPBUGS-11636 for additional information
Version-Release number of selected component (if applicable):
How reproducible:
Reproducible in affected regions.
Steps to Reproduce:
1. Use "ccoctl aws create-all" flow to create STS infrastructure in an affected region like us-east-2. Notice that document upload fails because the s3 bucket is created in a state that does not allow usage of ACLs with the s3 bucket.
Actual results:
./ccoctl aws create-all --name abutchertestue2 --region us-east-2 --credentials-requests-dir ./credrequests --output-dir _output 2023/04/11 13:01:06 Using existing RSA keypair found at _output/serviceaccount-signer.private 2023/04/11 13:01:06 Copying signing key for use by installer 2023/04/11 13:01:07 Bucket abutchertestue2-oidc created 2023/04/11 13:01:07 Failed to create Identity provider: failed to upload discovery document in the S3 bucket abutchertestue2-oidc: AccessControlListNotSupported: The bucket does not allow ACLs status code: 400, request id: 2TJKZC6C909WVRK7, host id: zQckCPmozx+1yEhAj+lnJwvDY9rG14FwGXDnzKIs8nQd4fO4xLWJW3p9ejhFpDw3c0FE2Ggy1Yc=
Expected results:
"ccoctl aws create-all" successfully creates IAM and S3 infrastructure. OIDC discovery and JWKS documents are successfully uploaded to the S3 bucket and are publicly accessible.
Additional info:
CI is flaky because the TestRouterCompressionOperation test fails.
I have seen these failures on 4.14 CI jobs.
Presently, search.ci reports the following stats for the past 14 days:
Found in 7.71% of runs (16.58% of failures) across 402 total runs and 24 jobs (46.52% failed)
GCP is most impacted:
pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator (all) - 44 runs, 86% failed, 37% of failures match = 32% impact
Azure and AWS are also impacted:
pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator (all) - 36 runs, 64% failed, 43% of failures match = 28% impact
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 38 runs, 79% failed, 23% of failures match = 18% impact
1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=compression+error%3A+expected&maxAge=336h&context=1&type=build-log&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.
The test fails:
TestAll/serial/TestRouterCompressionOperation === RUN TestAll/serial/TestRouterCompressionOperation router_compression_test.go:209: compression error: expected "gzip", got "" for canary route
CI passes, or it fails on a different test.
Description of problem:
// Defines resource requests and limits for the Alertmanager container.
should be
// Defines resource requests and limits for the Thanos Ruler container.
Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/66
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Since the introduction of https://github.com/openshift/origin/pull/27570 the openshift-tests binary now looks for the cluster infra resource for later usage (setting TEST_PROVIDER env var when running run-test command to inject details about the cluster). Since microshift does not have this resource the returned value is nil and it panics when its used later in the code.
Version-Release number of selected component (if applicable):
How reproducible:
Run openshift-tests and it immediately panics
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Due to removal of in-tree AWS provider https://github.com/kubernetes/kubernetes/pull/115838 we need to ensure that KCM is setting --external-cloud-volume-plugin flag accordingly, especially that the CSI migration was GA-ed in 4.12/1.25.
Description of problem:
In topology side panel, in pipelineruns section, on click of "Start last run" button, error alert message is displayed
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Create a deployment with pipeline 2. Click on deployment to open side panel 3. Click "Start last run" button in PipelineRuns section
Actual results:
Error alert message is displayed
Expected results:
Should be able to run the last run
Additional info:
Description of problem:
We have seen unit tests flaking on the mapping within the OnDelete policy tests for the control plane machine set. It turns out there is a race condition, and, given the right timing, if a reconcile is in progress while a machine is marked for deletion, the load balancing part of the algorithm fails to properly apply
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The Dockerfile should not reference any CI images.
Description of problem:
Sync "Debug in Terminal" feature with 3.x pods in web console The types of pods that enable the "Debug in terminal" feature should be in alignment with those in v3.11. See code here: https://github.com/openshift/origin-web-console/blob/c37982397087036321312172282e139da378eff2/app/scripts/directives/resources.js#L33-L53
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
UPSTREAM: <carry>: Force using host go always and use host libriaries introduced a build failure for the Windows kubelet that is showing up only in release-4.11 for an unknown reason but could potentially occur on other releases too.
Version-Release number of selected component (if applicable):
WMCO version: 9.0.0 and below
How reproducible:
Always on release-4.11
Steps to Reproduce:
1. Clone the WMCO repo 2. Build the WMCO image
Actual results:
WMCO image build fails
Expected results:
WMCO image build should succeed
Description of problem:
Most contents on "Command Line Tools" page are not i18n.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-10-165006
How reproducible:
Always
Steps to Reproduce:
1.Go to "?"-> "Command Line Tools" page. Add "?pseudolocalization=true&lng=en" at the end of the url. Check if all contents are i18n. 2. 3.
Actual results:
1. Most of contents are not i18n.
Expected results:
1.All contents should be i18n.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/104
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
OCP upgrade blocks because of cluster operator csi-snapshot-controller fails to start its deployment with a fatal message of read-only filesystem
Version-Release number of selected component (if applicable):
Red Hat OpenShift 4.11 rhacs-operator.v3.72.1
How reproducible:
At least once in user's cluster while upgrading
Steps to Reproduce:
1. Have a OCP 4.11 installed 2. Install ACS on top of the OCP cluster 3. Upgrade OCP to the next z-stream version
Actual results:
Upgrade gets blocked: waiting on csi-snapshot-controller
Expected results:
Upgrade should succeed
Additional info:
stackrox SCCs (stackrox-admission-control, stackrox-collector and stackrox-sensor) contain the `readOnlyRootFilesystem` set to `true`, if not explicitly defined/requested, other Pods might receive this SCC which will make the deployment to fail with a `read-only filesystem` message
Description of problem:
When installing a 3 master + 2 worker BM IPv6 cluster with proxy, worker BMHs are failing inspection with the message: "Could not contact ironic-inspector for version discovery: Unable to find a version discovery document". This causes the installation to fail due to nodes with worker role never joining the cluster. However, when installing with no workers, the issue does not reproduce and the cluster installs successfully.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-01-04-203333
How reproducible:
100%
Steps to Reproduce:
1. Attempt to install an IPv6 cluster with 3 masters + 2 workers and proxy with baremetal installer
Actual results:
Installation never completes because a number of pods are in Pending status
Expected results:
Workers join the cluster and installation succeeds
Additional info:
$ oc get events LAST SEEN TYPE REASON OBJECT MESSAGE 174m Normal InspectionError baremetalhost/openshift-worker-0-1 Failed to inspect hardware. Reason: unable to start inspection: Could not contact ironic-inspector for version discovery: Unable to find a version discovery document at https://[fd2e:6f44:5dd8::37]:5050, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled. 174m Normal InspectionError baremetalhost/openshift-worker-0-0 Failed to inspect hardware. Reason: unable to start inspection: Could not contact ironic-inspector for version discovery: Unable to find a version discovery document at https://[fd2e:6f44:5dd8::37]:5050, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled. 174m Normal InspectionStarted baremetalhost/openshift-worker-0-0 Hardware inspection started 174m Normal InspectionStarted baremetalhost/openshift-worker-0-1 Hardware inspection started
This is actually a better design since BMO does not need to be coupled with Ironic (unlike Ironic and httpd, for example). But the current architecture also has two real issues:
The main thing to fix is to make BMO talk to Ironic via its external IP instead of localhost.
Description of problem:
RHEL-7 already comes with {{xz}} installed but in RHEL-8 it needs to explicitly installed.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. Use an image based on Dockerfile.upi.ci.rhel8 2. Trigger a CI job that uses the xz tool 3.
Actual results:
/bin/sh: xz: command not found tar: /tmp/secret/terraform_state.tar.xz: Wrote only 4096 of 10240 bytes tar: Child returned status 127 tar: Error is not recoverable: exiting now
Expected results:
no errors
Additional info:
Step: https://github.com/openshift/release/blob/master/ci-operator/step-registry/upi/install/vsphere/upi-install-vsphere-commands.sh#L185 And investigation by Jinyun Ma: https://github.com/openshift/release/pull/39991#issuecomment-1581937323
Description of problem:
Machine and respective Node should indicate proper zones, but machine doesn’t indicate proper zones on multiple vCenter zones cluster
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-02-07-064924
How reproducible:
always
Steps to Reproduce:
1.Create a multiple vCenter zones cluster sh-4.4$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2023-02-07-064924 True False 58m Cluster version is 4.13.0-0.nightly-2023-02-07-064924 sh-4.4$ oc get machine NAME PHASE TYPE REGION ZONE AGE jima15b-x4584-master-0 Running us-east 88m jima15b-x4584-master-1 Running us-east 88m jima15b-x4584-master-2 Running us-west 88m jima15b-x4584-worker-0-26hml Running us-east 81m jima15b-x4584-worker-1-zljp8 Running us-east 81m jima15b-x4584-worker-2-kkdzf Running us-west 81m 2.Check machine labels and node labels sh-4.4$ oc get machine jima15b-x4584-worker-0-26hml -oyaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: poweredOn creationTimestamp: "2023-02-09T02:28:03Z" finalizers: - machine.machine.openshift.io generateName: jima15b-x4584-worker-0- generation: 2 labels: machine.openshift.io/cluster-api-cluster: jima15b-x4584 machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: jima15b-x4584-worker-0 machine.openshift.io/region: us-east machine.openshift.io/zone: "" name: jima15b-x4584-worker-0-26hml namespace: openshift-machine-api sh-4.4$ oc get node jima15b-x4584-worker-0-26hml --show-labels NAME STATUS ROLES AGE VERSION LABELS jima15b-x4584-worker-0-26hml Ready worker 9m4s v1.26.0+9eb81c2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east,failure-domain.beta.kubernetes.io/zone=us-east-1a,kubernetes.io/arch=amd64,kubernetes.io/hostname=jima15b-x4584-worker-0-26hml,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,node.openshift.io/os_id=rhcos,topology.csi.vmware.com/openshift-region=us-east,topology.csi.vmware.com/openshift-zone=us-east-1a,topology.kubernetes.io/region=us-east,topology.kubernetes.io/zone=us-east-1a
Actual results:
Machine doesn’t indicate proper zone, it's machine.openshift.io/zone: ""
Expected results:
Machine should indicate proper zone
Additional info:
Discussed here https://redhat-internal.slack.com/archives/GE2HQ9QP4/p1675848293159359
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
when checking the bug https://issues.redhat.com/browse/OCPBUGS-15976, found that the default ingresscontroller DNSReady is True even dns records failed to be published to public zone, the co/ingress doesn't report any error.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-05-191022
How reproducible:
100%
Steps to Reproduce:
1. install Azure cluster configured for manual mode with Azure Workload Identity 2. check dnsrecords of default-wildcard $ oc -n openshift-ingress-operator get dnsrecords default-wildcard -oyaml <---snip---> - conditions: - lastTransitionTime: "2023-07-10T04:23:55Z" message: 'The DNS provider failed to ensure the record: failed to update dns ...... reason: ProviderError status: "False" type: Published dnsZone: id: /subscriptions/xxxxx/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com 3. Check ingresscontroller status $ oc -n openshift-ingress-operator get ingresscontroller default -oyaml <---snip---> - lastTransitionTime: "2023-07-10T04:23:55Z" message: The record is provisioned in all reported zones. reason: NoFailedZones status: "True" type: DNSReady 4. Check co/ingress status $ oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.14.0-0.nightly-2023-07-05-191022 True False False 127m
Actual results:
1. DNSReady is True and message shows: The record is provisioned in all reported zones. 2. co/ingress doesn't report any error
Expected results:
DNSReady should be False since failed to publish to public zone
Additional info:
This is a clone of issue OCPBUGS-19314. The following is the description of the original issue:
—
As a user, I dont want to see the option of "DeploymentConfigs" in the User settings, when I have not installed the same in the cluster.
Description of problem:
When deploying 4.14 spoke, agentclusterinstall is stuck at finalizing stage
clusterverions on spoke report "Unable to apply 4.14.0-0.ci-2023-06-13-083232: the cluster operator monitoring is not available"
Please note: console operator is disabled purposely - it is needed in telco case to reduce platform resource usage
[kni@registry.kni-qe-28 ~]$ oc get clusterversions.config.openshift.io -A
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version False True 46m Unable to apply 4.14.0-0.ci-2023-06-13-083232: the cluster operator monitoring is not available
[kni@registry.kni-qe-28 ~]$ oc get clusterversions.config.openshift.io -n version -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2023-06-13T15:16:32Z" generation: 2 name: version resourceVersion: "20061" uid: f8fc0c3e-009d-4d86-a05d-2fd0aba59528 spec: capabilities: additionalEnabledCapabilities: - marketplace - NodeTuning baselineCapabilitySet: None channel: stable-4.14 clusterID: 5cfc0491-5a23-4383-935b-71e3c793e875 status: availableUpdates: null capabilities: enabledCapabilities: - NodeTuning - marketplace knownCapabilities: - CSISnapshot - Console - Insights - NodeTuning - Storage - baremetal - marketplace - openshift-samples conditions: - lastTransitionTime: "2023-06-13T15:16:33Z" message: 'Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.14&id=5cfc0491-5a23-4383-935b-71e3c793e875&version=4.14.0-0.ci-2023-06-13-083232": dial tcp 54.211.39.83:443: connect: network is unreachable' reason: RemoteFailed status: "False" type: RetrievedUpdates - lastTransitionTime: "2023-06-13T15:16:33Z" message: Capabilities match configured spec reason: AsExpected status: "False" type: ImplicitlyEnabledCapabilities - lastTransitionTime: "2023-06-13T15:16:33Z" message: Payload loaded version="4.14.0-0.ci-2023-06-13-083232" image="registry.kni-qe-28.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev/ocp-release@sha256:826bb878c5a1469ee8bb991beebc38a4e25b8f5cef9cdf1931ef99ffe5ffbc80" architecture="amd64" reason: PayloadLoaded status: "True" type: ReleaseAccepted - lastTransitionTime: "2023-06-13T15:16:33Z" status: "False" type: Available - lastTransitionTime: "2023-06-13T15:41:36Z" message: Cluster operator monitoring is not available reason: ClusterOperatorNotAvailable status: "True" type: Failing - lastTransitionTime: "2023-06-13T15:16:33Z" message: 'Unable to apply 4.14.0-0.ci-2023-06-13-083232: the cluster operator monitoring is not available' reason: ClusterOperatorNotAvailable status: "True" type: Progressing desired: image: registry.kni-qe-28.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev/ocp-release@sha256:826bb878c5a1469ee8bb991beebc38a4e25b8f5cef9cdf1931ef99ffe5ffbc80 version: 4.14.0-0.ci-2023-06-13-083232 history: - completionTime: null image: registry.kni-qe-28.ptp.lab.eng.bos.redhat.com:5000/openshift-release-dev/ocp-release@sha256:826bb878c5a1469ee8bb991beebc38a4e25b8f5cef9cdf1931ef99ffe5ffbc80 startedTime: "2023-06-13T15:16:33Z" state: Partial verified: false version: 4.14.0-0.ci-2023-06-13-083232 observedGeneration: 2 versionHash: H6tRc6p_ZWU= kind: List metadata: resourceVersion: "" [kni@registry.kni-qe-28 ~]$ oc get co -A NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.0-0.ci-2023-06-13-083232 True False False 14m cloud-controller-manager 4.14.0-0.ci-2023-06-13-083232 True False False 24m cloud-credential 4.14.0-0.ci-2023-06-13-083232 True False False 25m cluster-autoscaler 4.14.0-0.ci-2023-06-13-083232 True False False 24m config-operator 4.14.0-0.ci-2023-06-13-083232 True False False 25m control-plane-machine-set 4.14.0-0.ci-2023-06-13-083232 True False False 24m dns 4.14.0-0.ci-2023-06-13-083232 True False False 19m etcd 4.14.0-0.ci-2023-06-13-083232 True False False 22m image-registry 4.14.0-0.ci-2023-06-13-083232 True False False 14m ingress 4.14.0-0.ci-2023-06-13-083232 True False False 25m kube-apiserver 4.14.0-0.ci-2023-06-13-083232 True False False 18m kube-controller-manager 4.14.0-0.ci-2023-06-13-083232 True False False 19m kube-scheduler 4.14.0-0.ci-2023-06-13-083232 True False False 17m kube-storage-version-migrator 4.14.0-0.ci-2023-06-13-083232 True False False 25m machine-api 4.14.0-0.ci-2023-06-13-083232 True False False 25m machine-approver 4.14.0-0.ci-2023-06-13-083232 True False False 24m machine-config 4.14.0-0.ci-2023-06-13-083232 True False False 21m marketplace 4.14.0-0.ci-2023-06-13-083232 True False False 25m monitoring False True True 14m reconciling Console Plugin failed: creating ConsolePlugin object failed: the server could not find the requested resource (post consoleplugins.console.openshift.io) network 4.14.0-0.ci-2023-06-13-083232 True False False 26m node-tuning 4.14.0-0.ci-2023-06-13-083232 True False False 25m openshift-apiserver 4.14.0-0.ci-2023-06-13-083232 True False False 14m openshift-controller-manager 4.14.0-0.ci-2023-06-13-083232 True False False 18m operator-lifecycle-manager 4.14.0-0.ci-2023-06-13-083232 True False False 25m operator-lifecycle-manager-catalog 4.14.0-0.ci-2023-06-13-083232 True False False 25m operator-lifecycle-manager-packageserver 4.14.0-0.ci-2023-06-13-083232 True False False 19m service-ca 4.14.0-0.ci-2023-06-13-083232 True False False 25m
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. Deploy RAN DU spoke cluster via gitops ZTP approach with multiple base capabilities disabled including Console operator. spec: capabilities: additionalEnabledCapabilities: - marketplace - NodeTuning baselineCapabilitySet: None channel: stable-4.14 2. Monitor ocp deployment on spoke.
Actual results:
Deployment fails while finalizing agentclusterinstall. clusterverions on spoke report "the cluster operator monitoring is not available"
Expected results:
Successful spoke deployment
Additional info:
After manually enabling console in clusterversion, the monitoring operator succeeded and OCP install completed
must-gather logs:
https://drive.google.com/file/d/19zO21jqcVTIkAdGS2DEqQuhg2oGUmuNY/view?usp=sharing
https://drive.google.com/file/d/1PXjZmBdMwHWNwkaXr2wE9tTtBRJWYeKP/view?usp=sharing
Description of problem:
While reviewing PRs in CoreDNS 1.11.0, we stumbled upon https://github.com/coredns/coredns/pull/6179, which describes an CoreDNS crash in the kubernetes plugin if you create an EndpointSlice object contains a port without a port number. I reproduced this myself and was able to successfully bring down all of CoreDNS so that the cluster was put into a degraded state. We've bumped to CoreDNS 1.11.1 in 4.15, so this is concern for < 4.15.
Version-Release number of selected component (if applicable):
Less than or equal to 4.14
How reproducible:
100%
Steps to Reproduce:
1. Create an endpointslice with a port with no port number: apiVersion: discovery.k8s.io/v1 kind: EndpointSlice metadata: name: example-abc addressType: IPv4 ports: - name: "" 2.Shortly after creating this object, all DNS pods continuously crash: oc get -n openshift-dns pods NAME READY STATUS RESTARTS AGE dns-default-57lmh 1/2 CrashLoopBackOff 1 (3s ago) 79m dns-default-h6cvm 1/2 CrashLoopBackOff 1 (4s ago) 79m dns-default-mn7qd 1/2 CrashLoopBackOff 1 (3s ago) 79m dns-default-mxq5g 1/2 CrashLoopBackOff 1 (3s ago) 79m dns-default-wdrff 1/2 CrashLoopBackOff 1 (3s ago) 79m dns-default-zs7cd 1/2 CrashLoopBackOff 1 (3s ago) 79m
Actual results:
DNS Pods crash
Expected results:
DNS Pods should NOT crash
Additional info:
Description of problem:
The dynamic demo plugin locales is missing a correct plural string. The dynamic demo plugin doesn't make use of the script console uses to transform plural strings, so we need to update the plural string manually
This would help with the further validation of i18n dependencies update changes, and also the investigation of [Dynamic plugin translation support for plurals broken](https://issues.redhat.com/browse/OCPBUGS-11285) bug
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Deploy dynamic demo plugin on a cluster 2. Goto Overview page 3.
Actual results:
The Node Worker string is NOT in correct plural format
Expected results:
The node Worker string is in the correct plural format
Additional info:
Description of problem:
In order for Windows nodes to use the openshift-cluster-csi-drivers/internal-feature-states.csi.vsphere.vmware.com ConfigMap, which contains the configuration for vSphere CSI, `csi-windows-support` must be set to true. This is documented here: https://github.com/kubernetes-sigs/vsphere-csi-driver/blob/833421f42475809b4f76ea125095b5120af0f8e1/docs/book/features/csi_driver_on_windows.md#how-to-enable-vsphere-csi-with-windows-nodes Without this, a separate ConfigMap must be created and used for a user deploying Windows vSphere CSI drivers.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Add a Windows node to the cluster 2. Deploy vsphere csi daemonset for windows nodes as documented upstream 3. Add a Windows pod with a pvc mount
Actual results:
The pod is unable to mount the volume as windows support is not enabled
Expected results:
The pod can mount the volume
Additional info:
Description of problem:
When we exapnd the baremetal IP cluster with static IP, no information is logged if nmstate output is "--- {}\n" and the customized image generates without the static network configuration.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
100%
Steps to Reproduce:
1. Exapand baremetal ipi cluster node with the below invalid nmstate data. --- apiVersion: v1 kind: Secret metadata: name: openshift-worker-0-network-config-secret type: Opaque stringData: nmstate: | foo: bar: baz --- apiVersion: v1 kind: Secret metadata: name: openshift-worker-0-bmc-secret namespace: openshift-machine-api type: Opaque data: username: YWRtaW4K password: cGFzc3dvcmQK --- apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: openshift-worker-0 namespace: openshift-machine-api spec: online: True bootMACAddress: 52:54:00:11:22:b4 bmc: address: ipmi://192.168.123.1:6233 credentialsName: openshift-worker-0-bmc-secret disableCertificateVerification: True username: admin password: password rootDeviceHints: deviceName: "/dev/sda" preprovisioningNetworkDataName: openshift-worker-0-network-config-secret 2. Check if an IP is configured with the node 3.
Actual results:
No static network configuration in the metal3 customized image.
Expected results:
Information should be logged and the metal3 customized image should not be generated.
Additional info:
https://github.com/openshift/image-customization-controller/pull/72
This is a clone of issue OCPBUGS-17724. The following is the description of the original issue:
—
Environment: OCP 4.12.24
Installation Method: IPI: Manual Mode + STS using a customer provider AWS IAM Role
I am trying to deploy an OCP4 cluster on AWS for my customer. The customer does not permit creation of IAM users so I am performing a Manual Mode with STS IPI installation instead. I have been given an IAM role to assume for the OCP installation, but unfortunately the customer's AWS Organizational Service Control Policy (SCP) does not permit the use of the iam:GetUser{} permission.
(I have informed my customer that iam:GetUser is an installation requirement - it's clearly documented in our docs, and I have raised a ticket with their internal support team requesting that their SCP is amended to include iam:getUser, however I have been informed that my request is likely to be rejected).
With this limitation understood, I still attempted to install OCP4. Surprisingly, I was able to deploy an OCP (4.12) cluster without any apparent issues, however when I tried to destroy the cluster I encountered the following error from the installer (note: fields in brackets <> have been redacted):
DEBUG search for IAM roles
DEBUG iterating over a page of 74 IAM roles
DEBUG search for IAM users
DEBUG iterating over a page of 1 IAM users
INFO get tags for <ARN of the IAM user>: AccessDenied: User:<ARN of my user> is notauthorized to perform: iam:GetUser on resource: <IAMusername> with an explicit deny in a service control policy
INFO status code: 403, request id: <request ID>
DEBUG search for IAM instance profiles
INFO error while finding resources to delete error=get tags for <ARN of IAM user> AccessDenied: User:<ARN of my user> is not authorized to perform: iam:GetUser on resource: <IAM username> with an explicit deny in a service control policy status code: 403, request id: <request ID>
Similarly, the error in AWS CloudTrail logs shows the following (note: some fields in brackets have been redacted):
User: arn:aws:sts::<AWS account no>:assumed-role/<role-name>/<user name> is not authorized to perform: iam:GetUser on resource <IAM User> with an explicit deny in a service control policy
It appears that the destroy operation is failing when the installer is trying to list tags on the only IAM user in the customer's AWS account. As discussed, the SCP does not permit the use of iam:GetUser and consequently this API call on the IAM user is denied. The installer then enters an endless loop as it continuously retries the operation. We have potentially identified the iamUserSearch function within the installer code at pkg/destroy/aws/iamhelpers.go as the area where this call is failing.
There does not appear to be a handler for "AccessDenied" API error in this function. Therefore we request that the access denied event is gracefully handled and skipped over when processing IAM users, allowing the installer to continue with the destroy operation, much in the same way that a similar access denied event is handled within the iamRoleSearch function when processing IAM roles:
We therefore request that the following is considered and addressed:
1. Re-assess if the iam:GetUser permission is actually needed for cluster installation/cluster operations.
2. If the permission is required then the installer should provide a warning or halt the installation.
2. During a "destroy" cluster operation - the installer should gracefully handle AccessDenied errors from the API and "skip over" any IAM Users that the installer does not have permission to list tags for and then continue gracefully with the destroy operation.
Controller should wait till service will timeout on cvo and not timeout by itself
Please review the following PR: https://github.com/openshift/machine-api-provider-ibmcloud/pull/18
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oauth-server/pull/119
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When an HCP Service LB is created, for example for an IngressController, the CAPA controller calls ModifyNetworkInterfaceAttribute. It references the default security group for the VPC in addition to the security group created for the cluster ( with the right tags). Ideally, the LBs (and any other HCP components) should not be using the default VPC SecurityGroup
Version-Release number of selected component (if applicable):
All 4.12 and 4.13
How reproducible:
100%
Steps to Reproduce:
1. Create HCP 2. Wait for Ingress to come up. 3. Look in CloudTrail for ModifyNetworkInterfaceAttribute, and see default security group referenced
Actual results:
Default security group is used
Expected results:
Default security group should not be used
Additional info:
This is problematic as we are attempting to scope our AWS permissions as small as possible. The goal is to only use resources that are tagged with `red-hat-managed: true` so that our IAM Policies can conditioned to only access these resources. Using the Security Group created for the cluster should be sufficient, and the default Security Group does not need to be used, so if the usage can be removed here, we can secure our AWS policies that much better. Similar to OCPBUGS-11894
Description of problem:
oc idle tests do not expect the deprecation warning in its output and breaks.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Run the test 2. Watch it fail 3.
Actual results:
Error running /usr/bin/oc --namespace=e2e-test-oc-idle-hns4c --kubeconfig=/tmp/configfile3347652119 describe deploymentconfigs v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+ deploymentconfig.apps.openshift.io: StdOut> Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+ Error from server (NotFound): deploymentconfigs.apps.openshift.io "v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+ deploymentconfig.apps.openshift.io" not found StdErr> Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+ Error from server (NotFound): deploymentconfigs.apps.openshift.io "v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+ deploymentconfig.apps.openshift.io" not found exit status 1
Expected results:
Tests should pass
Additional info:
I have tracked down the problem to this line: https://github.com/openshift/origin/blob/master/test/extended/cli/idle.go#LL49C40-L49C40 deploymentConfigName gets assigned to "v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+ deploymentconfig.apps.openshift.io", which leads to the next command not finding a deployment config.
Description of problem:
The target.workload.openshift.io/management annotation causes CNO operator pods to wait for nodes to appear. Eventually they give up waiting and they get scheduled. This annotation should not be set for the hosted control plane topology, given that we should not wait for nodes to exist for the CNO to be scheduled.
Version-Release number of selected component (if applicable):
4.14, 4.13
How reproducible:
always
Steps to Reproduce:
1. Create IBM ROKS cluster 2. Wait for cluster to come up 3.
Actual results:
Cluster takes a long time to come up because CNO pods take ~15 min to schedule.
Expected results:
Cluster comes up quickly
Additional info:
Note: Verification for the fix has already happened on the IBM Cloud side. All OCP QE needs to do is to make sure that the fix doesn't cause any regression to the regular OCP use case.
Description of problem:
Techpreview parallel jobs are failing due to changes in the insights operator Example failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview/1663408887002304512 Looks like it's from https://github.com/openshift/insights-operator/pull/764 https://sippy.dptools.openshift.org/sippy-ng/jobs/4.14/analysis?filters=%7B%22items%22%3A%5B%7B%22id%22%3A0%2C%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-techpreview%22%7D%2C%7B%22id%22%3A1%2C%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview%22%7D%2C%7B%22id%22%3A2%2C%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-techpreview%22%7D%5D%2C%22linkOperator%22%3A%22or%22%7D
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
While having a cluster with 3 masters and attaching 5 additional disks , on the 3 masters , checking the device storage sets for the operator show only 3 storage devices and not as expected the 5 additional disks
How reproducible:
80%,
OCP 4.12, OCS 4.12.1
also reproduces on OCP 4.11
Steps to reproduce:
1. Create a Cluster with 3 master nodes
2. attach 2 additional disks to master1 , 2 additional disks to master 2 , 1 additional disk to master 3
3. check count of storage devices on operator
Actual results:
operator show device set count = 3
Expected results:
device set count should be as the amount of the different valid additional attached disks (= 5)
Description of problem:
When deploying hosts using ironic's agent both the ironic service address and inspector address are required. The ironic service is proxied such that it can be accessed at a consistent endpoint regardless of where the pod is running. This is not the case for the inspection service. This means that if the inspection service moves after we find the address, provisioning will fail. In particular this non-matching behavior is frustrating when using the CBO [GetIronicIP function|https://github.com/openshift/cluster-baremetal-operator/blob/6f0a255fdcc7c0e5c04166cb9200be4cee44f4b7/provisioning/utils.go#L95-L127] as one return value is usable forever but the other needs to somehow be re-queried every time the pod moves.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Relatively
Steps to Reproduce:
1. Retrieve the inspector IP from GetIronicIP 2. Reschedule the inspector service pod 3. Provision a host
Actual results:
Ironic python agent raises an exception
Expected results:
Host provisions
Additional info:
This was found while deploying clusters using ZTP In this scenario specifically an image containing the ironic inspector IP is valid for an extended period of time. The same image can be used for multiple hosts and possibly multiple different spoke clusters. Our controller shouldn't be expected to watch the ironic pod to ensure we update the image whenever it moves. The best we can do is re-query the inspector IP whenever a user makes changes to the image, but that may still not be often enough.
Description of problem:
when catalogsouce name started with number , the pod will not running well , could we add checkpoint for the name , if the name is not suitable for regex used validation ''[a-z]([-a-z0-9]*[a-z0-9])?'')', print message and can't create the catalogsource .
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1.skopeo copy --all --format v2s2 docker://icr.io/cpopen/ibm-zcon-zosconnect-catalog@sha256:6f02ecef46020bcd21bdd24a01f435023d5fc3943972ef0d9769d5276e178e76 oci:///home1/611/oci-index 2. change the work directory to : `cd home1/611/oci-index` 3. run the oc-mirror command : cat config.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: local: path: /home1/ocilocalstorage mirror: operators: - catalog: oci:///home1/611/oci-index `oc-mirror --config config.yaml docker://ec2-18-217-58-249.us-east-2.compute.amazonaws.com:5000/multi-oci --dest-skip-tls --include-local-oci-catalogs` 4. apply the catalogsouce and ICSP yaml file; 5 . check the catalogsource pod
Actual results:
[root@preserve-fedora36 oci-index]# oc get pod --show-labels NAME READY STATUS RESTARTS AGE LABELS 611-oci-index-2sfh8 0/1 Terminating 0 4s olm.catalogSource=611-oci-index,olm.pod-spec-hash=6b8656f87 611-oci-index-dbj9b 0/1 ContainerCreating 0 1s olm.catalogSource=611-oci-index,olm.pod-spec-hash=6b8656f87 611-oci-index-w4tfd 0/1 Terminating 0 2s olm.catalogSource=611-oci-index,olm.pod-spec-hash=6b8656f87 611-oci-index-zj8nn 0/1 Terminating 0 3s olm.catalogSource=611-oci-index,olm.pod-spec-hash=6b8656f87 oc get catalogsource 611-oci-index -oyaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: "2023-05-10T03:01:36Z" generation: 1 name: 611-oci-index namespace: openshift-marketplace resourceVersion: "97108" uid: 2287434b-9e70-4865-b1a1-95997165f94e spec: image: ec2-18-217-58-249.us-east-2.compute.amazonaws.com:5000/multi-oci/home1/611/oci-index:6f02ec sourceType: grpc status: message: 'couldn''t ensure registry server - error ensuring service: 611-oci-index: Service "611-oci-index" is invalid: metadata.name: Invalid value: "611-oci-index": a DNS-1035 label must consist of lower case alphanumeric characters or ''-'', start with an alphabetic character, and end with an alphanumeric character (e.g. ''my-name'', or ''abc-123'', regex used for validation is ''[a-z]([-a-z0-9]*[a-z0-9])?'')' reason: RegistryServerError
Expected results:
should not create the catalogsouce when it's name is not suitable for the regex used validation .
Additional info:
rename the catalogsource with oci-611-index, pod running well, and could create the operator and instance .
Description of problem:
The current version of openshift/cluster-ingress-operator vendors Kubernetes 1.26 packages. OpenShift 4.13 is based on Kubernetes 1.27.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.14/go.mod
Actual results:
Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.26
Expected results:
Kubernetes packages are at version v0.27.0 or later.
Additional info:
Using old Kubernetes API and client packages brings risk of API compatibility issues. controller-runtime will need to be bumped to 1.15 as well
after the 'runbook_url' annotation test was increased in severity in https://github.com/openshift/origin/pull/27933 it started permafailing
example logs
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ironic-image/379/pull-ci-openshift-ironic-image-master-prevalidation-e2e-metal-ipi-virtualmedia-prevalidation/1666311316056313856
This is a clone of issue OCPBUGS-19376. The following is the description of the original issue:
—
Description of problem:
IPI installation using the service account attached to a GCP VM always fail with error "unable to parse credentials"
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-15-233408
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" 2. edit install-config.yaml to insert "credentialsMode: Manual" 3. "create manifests" 4. manually create the required credentials and copy the manifests to installation-dir/manifests directory 5. launch the bastion host along with binding to the pre-configured service account ipi-on-bastion-sa@openshift-qe.iam.gserviceaccount.com and scopes being "cloud-platform" 6. copy the installation-dir and openshift-install to the bastion host 7. try "create cluster" on the bastion host
Actual results:
The installation failed on "Creating infrastructure resources"
Expected results:
The installation should succeed.
Additional info:
(1) FYI the 4.12 epic: https://issues.redhat.com/browse/CORS-2260 (2) 4.12.34 doesn't have the issue (Flexy-install/234112/). (3) 4.13.13 doesn’t have the issue (Flexy-install/234126/). (4) The 4.14 errors (Flexy-install/234113/): 09-19 16:13:44.919 level=info msg=Consuming Master Ignition Config from target directory 09-19 16:13:44.919 level=info msg=Consuming Bootstrap Ignition Config from target directory 09-19 16:13:44.919 level=info msg=Consuming Worker Ignition Config from target directory 09-19 16:13:44.919 level=info msg=Credentials loaded from gcloud CLI defaults 09-19 16:13:49.071 level=info msg=Creating infrastructure resources... 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg=Error: unable to parse credentials 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg= with provider["openshift/local/google"], 09-19 16:13:50.950 level=error msg= on main.tf line 10, in provider "google": 09-19 16:13:50.950 level=error msg= 10: provider "google" { 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg=unexpected end of JSON input 09-19 16:13:50.950 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "cluster" stage: failed to create cluster: failed to apply Terraform: exit status 1 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg=Error: unable to parse credentials 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg= with provider["openshift/local/google"], 09-19 16:13:50.950 level=error msg= on main.tf line 10, in provider "google": 09-19 16:13:50.950 level=error msg= 10: provider "google" { 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg=unexpected end of JSON input 09-19 16:13:50.950 level=error
Agent does not replace localhost.localdomain node names with MAC addresses
in case Cluster network configuration is Static IPs with VLAN
Found in agent log
Dec 20 17:37:42 localhost.localdomain inventory[2284]: time="20-12-2022 17:37:42" level=info msg="Replaced original forbidden hostname with calculated one" file="inventory.go:63" calculated=localhost.localdomain original=localhost.localdomain
As result
Cluster is not ready yet.
The cluster is not ready yet. Some hosts have an ineligible name. To change the hostname, click on it.
How reproducible:
1. Provision libvirt VMs and network with VLAN
2. Create cluster and select Static IP Network configuration
3. Fill all required filed in from view and press Next
4. Generate and download ISO
5. Wait until nodes will be UP and discovered
Actual results:
Nodes have localhost.localdomain names
Expected results:
Nodes have name as host's MAC address
Description of problem:
Cluster Network Operator managed component multus-admission-controller does not conform to Hypershift control plane expectations. When CNO is managed by Hypershift, multus-admission-controller must run with non-root security context. If Hypershift runs control plane on kubernetes (as opposed to Openshift) management cluster, it adds pod or container security context to most deployments with runAsUser clause inside. In Hypershift CPO, the security context of deployment containers, including CNO, is set when it detects that SCC's are not available, see https://github.com/openshift/hypershift/blob/9d04882e2e6896d5f9e04551331ecd2129355ecd/support/config/deployment.go#L96-L100. In such a case CNO should do the same, set security context for its managed deployment multus-admission-controller to meet Hypershift standard.
How reproducible:
Always
Steps to Reproduce:
1.Create OCP cluster using Hypershift using Kube management cluster 2.Check pod security context of multus-admission-controller
Actual results:
no pod security context is set
Expected results:
pod security context is set with runAsUser: xxxx
Additional info:
This is the highest priority item from https://issues.redhat.com/browse/OCPBUGS-7942 and it needs to be fixed ASAP as it is a security issue preventing IBM from releasing Hypershift-managed Openshift service.
Description of the problem:
When 9.2 based live iso is used in agentserviceconfig, after booting into CD, spoke console stuck at acquire live pxe rootfs with could not resolve host error.
It seems the DNS server configured in nmstate is not applied to spoke.
How reproducible:
100%
Steps to reproduce:
2. install SNO via ZTP
3. Monitor install CRs on hub
Actual results:
Expected results:
Extra info:
Description of the problem:
Infraenv creation data missing
How reproducible:
data is propagated only on infraenv update
Steps to reproduce:
1. create new cluster
2. check elastic data: some special feature is missing
Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/42
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-13152. The following is the description of the original issue:
—
Description of problem:
With OCPBUGS-11099 our Pipeline Plugin supports the TektonConfig config "embedded-status: minimal" option that will be the default in OpenShift Pipelines 1.11+.
But since this change, the Pipeline pages loads the TaskRuns for any Pipeline and PipelineRun rows. To decrease the risk of a performance issue we should make this call only if the status.tasks wasn't defined.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Actual results:
The list page load a list of TaskRuns for each Pipeline / PipelineRun also if the PipelineRun contains the related data already (status.tasks)
Expected results:
No unnecessary network calls. When the admin changes the TektonConfig config "embedded-status" option to minimal the UI should still work and load the TaskRuns as it does it today.
Additional info:
None
Description of the problem:
#!/bin/bashwhile sleep 0.5; do for i in {1..10}; do curl -I -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" 'https://api.stage.openshift.com/api/assisted-install/v2/infra-envs/3dc00d41-46bf-4b83-9874-f21812263c97/downloads/files?discovery_iso_type=full-iso&file_name=discovery.ign' > /dev/null & done ; done
This script above would cause assisted-service to spike CPU and 99th percentile of requests to jump to 10s
How reproducible:
100%
Steps to reproduce:
1. run script above
2. check response time/cpu usage
3.
Actual results:
response time really slow / 504
Expected results:
service continues to run smoothly
Description of the problem:
Change the user message from: "Host is not compatible with cluster platform %s; either disable this host or choose a compatible cluster platform (%v)" to "Host is not compatible with cluster platform %s; either disable this host or discover a new, compatible host."
How reproducible:
100%
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Description of problem:
Fix grammatical error in feedback modal. Remove 'the' before openshift text.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
OCP FeatureGate object gets a new status field, where the enabled feature gates are listed. We should use this new field instead of parsing FeatureGate.Spec.
This should be fully transparent to users, they still set FeatureGate.Spec and they should still observe that SharedResource CSI driver + operator is installed when they enable TechPreviewNoUpgrade feature set there.
Enhancement: https://github.com/openshift/cluster-storage-operator/pull/368
Sanitize OWNERS/OWNER_ALIASES:
1) OWNERS must have:
component: "Storage / Kubernetes External Components"
2) OWNER_ALIASES must have all team members of Storage team.
Description of problem:
Metrics page is broken
Version-Release number of selected component (if applicable):
Openshift Pipelines 1.9.0 on 4.12
How reproducible:
Always
Steps to Reproduce:
1. Install Openshift Pipelines 1.9.0 2. Create a pipeline and run it several times 3. Update metrics.pipelinerun.duration-type and metrics.taskrun.duration-type to lastvalue 4. Navigate to created pipeline 5. Switch to Metrics tab
Actual results:
The Metrics page is showing error
Expected results:
Metrics of the pipeline should be shown
Additional info:
Description of problem:
There are different versions, channel for the operator, but may be they use the same 'latest' label, when mirroring them as `additionalImages`, got the below error:
[root@ip-172-31-249-209 jian]# oc-mirror --config mirror.yaml file:///root/jian/test/ ... ... sha256:672b4bee759f8115e5538a44c37c415b362fc24b02b0117fd4bdcc129c53e0a1 file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest sha256:d90aecc425e1b2e0732d0a90bc84eb49eb1139e4d4fd8385070d00081c80b71c file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists info: Mirroring completed in 22.48s (125.8MB/s) error: one or more errors occurred while uploading images
Version-Release number of selected component (if applicable):
[root@ip-172-31-249-209 jian]# oc-mirror version Client Version: version.Info{Major:"0", Minor:"1", GitVersion:"v0.1.0", GitCommit:"6ead1890b7a21b6586b9d8253b6daf963717d6c3", GitTreeState:"clean", BuildDate:"2022-08-25T05:27:39Z", GoVersion:"go1.17.12", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1. use the below config: [cloud-user@preserve-olm-env2 mirror-tmp]$ cat mirror.yaml apiVersion: mirror.openshift.io/v1alpha1 kind: ImageSetConfiguration # archiveSize: 4 mirror: additionalImages: - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:46a62d73aeebfb72ccc1743fc296b74bf2d1f80ec9ff9771e655b8aa9874c933 - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:9e549c09edc1793bef26f2513e72e589ce8f63a73e1f60051e8a0ae3d278f394 - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:c16891ee9afeb3fcc61af8b2802e56605fff86a505e62c64717c43ed116fd65e - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:5c37bd168645f3d162cb530c08f4c9610919d4dada2f22108a24ecdea4911d60 - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:89a6abbf10908e9805d8946ad78b98a13a865cefd185d622df02a8f31900c4c1 - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:de5b339478e8e1fc3bfd6d0b6784d91f0d3fbe0a133354be9e9d65f3d7906c2d - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:fdf774c4365bde48d575913d63ef3db00c9b4dda5c89204029b0840e6dc410b1 - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:d90aecc425e1b2e0732d0a90bc84eb49eb1139e4d4fd8385070d00081c80b71c - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:15cc75164335fa178c80db4212d11e4a793f53d2b110c03514ce4c79a3717ca0 - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:9e66db3a282ee442e71246787eb24c218286eeade7bce4d1149b72288d3878ad - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:546b14c1f3fb02b1a41ca9675ac57033f2b01988b8c65ef3605bcc7d2645be60 - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:12d7061012fd823b57d7af866a06bb0b1e6c69ec8d45c934e238aebe3d4b68a5 - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:41025e3e3b72f94a3290532bdd6cabace7323c3086a9ce434774162b4b1dd601 - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:672b4bee759f8115e5538a44c37c415b362fc24b02b0117fd4bdcc129c53e0a1 - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:92542b22911fbd141fadc53c9737ddc5e630726b9b53c477f4dfe71b9767961f - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5 - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:1feb7073dec9341cadcc892df39ae45c427647fb034cf09dce1b7aa120bbb459 - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:7ca05f93351959c0be07ec3af84ffe6bb5e1acea524df210b83dd0945372d432 - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:c0fe8830f8fdcbe8e6d69b90f106d11086c67248fa484a013d410266327a4aed - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06 - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:b386d0e1c9e12e9a3a07aa101257c6735075b8345a2530d60cf96ff970d3d21a 2. Run the $ oc-mirror --config mirror.yaml file:///root/jian/test/
Actual results:
error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists
Expected results:
No error
Additional info:
CI is flaky because of test failures such as the following:
[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel] Run #0: Failed { fail [github.com/openshift/origin/test/extended/authorization/scc.go:69]: 1 pods failed before test on SCC errors Error creating: pods "azure-file-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[9]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[10]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.initContainers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.initContainers[0].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[0].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[1].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[1].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[1].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[2].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[2].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/azure-file-csi-driver-node -n openshift-cluster-csi-drivers happened 12 times Ginkgo exit error 1: exit with code 1} Run #1: Failed { fail [github.com/openshift/origin/test/extended/authorization/scc.go:69]: 1 pods failed before test on SCC errors Error creating: pods "azure-file-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[9]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[10]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.initContainers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.initContainers[0].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[0].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[1].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[1].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[1].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, spec.containers[2].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[2].securityContext.containers[0].hostPort: Invalid value: 10302: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/azure-file-csi-driver-node -n openshift-cluster-csi-drivers happened 12 times Ginkgo exit error 1: exit with code 1}
This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/901/pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-ovn/1638557668689842176. Search.ci has additional similar errors.
I have seen these failures in 4.14 CI jobs.
Presently, search.ci shows the following stats for the past two days:
Found in 0.00% of runs (0.01% of failures) across 131399 total runs and 7623 jobs (19.50% failed) in 1.01s
1. Post a PR and have bad luck.
2. Check search.ci: https://search.ci.openshift.org/?search=pods+%22azure-file-csi-driver-%28controller%7Cnode%29-%22+is+forbidden&maxAge=168h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
CI fails.
CI passes, or fails on some other test failure, and the failures don't show up in search.ci.
Description of problem:
with new s3 bucket, hc failed with condition : - lastTransitionTime: “2023-04-13T14:17:11Z” message: ‘failed to upload /.well-known/openid-configuration to the heli-hypershift-demo-oidc-2 s3 bucket: aws returned an error: AccessControlListNotSupported’ observedGeneration: 3 reason: OIDCConfigurationInvalid status: “False” type: ValidOIDCConfiguration
Version-Release number of selected component (if applicable):
How reproducible:
1 create s3 bucket $ aws s3api create-bucket --create-bucket-configuration LocationConstraint=us-east-2 --region=us-east-2 --bucket heli-hypershift-demo-oidc-2 { "Location": "http://heli-hypershift-demo-oidc-2.s3.amazonaws.com/" } [cloud-user@heli-rhel-8 ~]$ aws s3api delete-public-access-block --bucket heli-hypershift-demo-oidc-2 2 install HO and create a hc on aws us-west-2 3. hc failed with condition: - lastTransitionTime: “2023-04-13T14:17:11Z” message: ‘failed to upload /.well-known/openid-configuration to the heli-hypershift-demo-oidc-2 s3 bucket: aws returned an error: AccessControlListNotSupported’ observedGeneration: 3 reason: OIDCConfigurationInvalid status: “False” type: ValidOIDCConfiguration
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
create a hc successfully
Additional info:
The dns operator appears to have begun frequently spamming kube Events in some serial jobs across multiple clouds. (especially gcp and azure, aws is less common but there are some failures with the same signature)
The pathological events test and here it appears this started on May 5th. See the Pass Rate By NURP+ Combination panel for where this is most common.
As of the date of filing, pass rates are:
56% - gcp, amd64, sdn, ha, serial, techpreview
57% - gcp, amd64, sdn, ha, serial
60% - azure, amd64, ovn, ha, serial
60% - azure, amd64, ovn, ha, serial, techpreview
The events seem to consistently appear as follows on all clouds:
ns/openshift-dns service/dns-default hmsg/ade328ddf3 - pathological/true reason/TopologyAwareHintsDisabled Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (5 endpoints, 3 zones), addressType: IPv4 From: 08:58:41Z To: 08:58:42Z
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-techpreview-serial/1656207924667617280 (intervals)
The Intervals item under "Debug Tools" is a great way to see these charted in time, see the "interesting events" section.
test=[sig-arch] events should not repeat pathologically for namespace openshift-dns
Description of problem:
Not able to provision a new baremetalhost because ironic is not able to find a suitable virtual media device.
Version-Release number of selected component (if applicable):
How reproducible:
100% if you have a UCS Blade
Steps to Reproduce:
1. add the baremetalhost 2. wait for the error 3.
Actual results:
No suitable virtual media device found.
Expected results:
That the provisioning would succeeed
Additional info:
I tried to insert an ISO using curl and I can do it on the virtualmedia[3] device, which is a virtual DVD. When I'm looking at the metal3-ironic logs I can see the follow entry: Received representation of VirtualMedia /redfish/v1/Managers/CIMC/VirtualMedia/3: {'_actions': {'eject_media': {'operation_apply_time_support': None, 'target_uri': '/redfish/v1/Managers/CIMC/VirtualMedia/3/Actions/VirtualMedia.EjectMedia'}, 'insert_media': {'operation_apply_time_support': None, 'target_uri': '/redfish/v1/Managers/CIMC/VirtualMedia/3/Actions/VirtualMedia.InsertMedia'}}, '_certificates_path': None, '_oem_vendors': ['Cisco'], 'connected_via': <ConnectedVia.URI: 'URI'>, 'identity': '3', 'image': None, 'image_name': None, 'inserted': False, 'links': None, 'media_types': [<VirtualMediaType.DVD: 'DVD'>], 'name': 'CIMC-Mapped vDVD', 'status': {'health': <Health.OK: 'OK'>, 'health_rollup': None, 'state': <State.DISABLED: 'Disabled'>}, 'transfer_method': None, 'user_name': None, 'verify_certificate': None, 'write_protected': False} I'm sure this is the correct device, and verified that I can insert vmedia using curl. Someone metal3/ironic is not selecting this device. I'm suspecting that the reason is that "DVD" is not a valid media_type. When I look at [the ironic code](https://github.com/openstack/ironic/blob/b4f8209b99af32d8d2a646591af9b62436aad3d8/ironic/drivers/modules/redfish/boot.py#LL188C31-L188C31) I can see that there is a check for the media_type. I'm not able to see which values are accepted by metal3. I was able to validate the media_types for a rackmount server which works and there I see the following values: "CD, DVD". This led me to believe that DVD is not an accepted value. Can you please confirm that this is the case and if so, can we add the DVD as a suitable device?
Description of problem:
Customer is facing issue with console slowness when loading workloads page having 300+ workloads.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. Login to OCP console 2. Workloads — > Projects --> Project-> Deployment Configs(300+) 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/97
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
oc should not append the -x86_64 suffix when mirroring multi-arch payloads
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1.oc adm release mirror quay.io/openshift-release-dev/ocp-release:4.12.13-multi --keep-manifest-list=true --to=someregistry.io/somewhere/release 2. 3.
Actual results:
05-31 04:54:15.807 sha256:cd8639e34840833dd98d8323f1999b00ca06c73d7ae9ad8945f7b397450821ee -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-insights-operator 05-31 04:54:15.807 sha256:d0443f26968a2159e8b9590b33c428b6af7c0220ab6cc13633254d8843818cdf -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-keepalived-ipfailover 05-31 04:54:15.807 sha256:d2126187264d04f812068c03b59316547f043f97e90ec1a605ac24ab008c85a0 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-agent-installer-orchestrator 05-31 04:54:15.807 sha256:d445a4ece53f0695f1b812920e4bbb8a73ceef582918a0f376c2c5950a3e050b -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-ovn-kubernetes 05-31 04:54:15.807 sha256:d4bfe3bac81d5bb758efced8706a400a4b1dad7feb2c9a9933257fde9f405866 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-csi-snapshot-controller 05-31 04:54:15.807 sha256:d50c009e4b47bb6d93125c08c19c13bf7fd09ada197b5e0232549af558b25d19 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-vsphere-csi-driver-operator 05-31 04:54:15.807 sha256:d844ecbbba99e64988f4d57de9d958172264e88b9c3bfc7b43e5ee19a1a2914e -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-ironic 05-31 04:54:15.807 sha256:d90b37357d4c2c0182787f6842f89f56aaebeab38a139c62f4a727126e036578 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-baremetal-machine-controllers 05-31 04:54:15.807 sha256:d928536d8d9c4d4d078734004cc9713946da288b917f1953a8e7b1f2a8428a64 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-azure-cloud-controller-manager 05-31 04:54:15.807 sha256:da049d5a453eeb7b453e870a0c52f70df046f2df149bca624248480ef83f2ac8 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-cli-artifacts 05-31 04:54:15.807 sha256:db1cf013e3f845be74553eecc9245cc80106b8c70496bbbc0d63b497dcbb6556 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-cluster-capi-controllers 05-31 04:54:15.807 sha256:dc7b1305c7fec48d29adc4d8b3318d3b1d1d12495fb2d0ddd49a33e3b6aed0cc -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-gcp-pd-csi-driver 05-31 04:54:15.807 sha256:de8753eb8b2ccec3474016cd5888d03eeeca7e0f23a171d85b4f9d76d91685a3 -> 4.14.0-0.nightly-multi-2023-05-30-024840-x86_64-baremetal-installer
Expected results:
no -x86_64 suffix added to the images tags
Additional info:
Description of problem:
Navigation:
Workloads -> Deployments -> Edit update strategy
'greater than pod' is in English
Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-06-23-044003
How reproducible:
Always
Steps to Reproduce:
1.
2.
3.
Actual results:
Translation missing
Expected results:
Translation should appear
Additional info:
Description of the problem:
BE 2.16, base domain allows 1 char string long. This results with cluster address like: clustername.r, but in networking page I get DNS wildcard not configured
How reproducible:
100%
Steps to reproduce:
1. Create a cluster with 1 character string as base domain (i.e. "c" )
2. move to Networking page
3. set all needed info (api + ingress vips) . Validation error - DNS wildcard not configured: is shown
Actual results:
Expected results:
Description of problem:
Global configuration of 'KnativeServing' is missing after user installed the Operator of 'Serverless' successfully
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-13-223353
How reproducible:
Always
Steps to Reproduce:
1. Installed 'Serveless' Operator, make sure the operator has been installed successfully, and the Knative Serving instance is created without any error 2. Navigate to Administration -> Cluster Settings -> Global Configuration 3. Check if KnativeServing is listed in the Cluster Setting page
Actual results:
KnativeServing is missing
Expected results:
KnativeServing should list in the Global Configuration page
Additional info:
Description of problem:
when use oci-registries-config, the oc-mirror will panic
Version-Release number of selected component (if applicable):
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.14.0-202308091944.p0.gdba4a0c.assembly.stream-dba4a0c", GitCommit:"dba4a0cfd0a9fd29c1e4b5bc1da737e1153cc679", GitTreeState:"clean", BuildDate:"2023-08-10T00:13:31Z", GoVersion:"go1.20.5 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1. mirror to localhost : cat config.yaml apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration mirror: operators: - catalog: oci:///home1/oci-414 packages: - name: cluster-logging oc-mirror --config config.yaml docker://localhost:5000 --dest-use-http 2. use oci-registries-config `oc-mirror --config config.yaml docker://localhost:5000 --dest-use-http --oci-registries-config /home1/registry.conf`
Actual results:
2. The oc-mirror will panic : oc-mirror --config config.yaml docker://ec2-18-117-165-30.us-east-2.compute.amazonaws.com:5000 --dest-use-http --oci-registries-config /home1/registry.conf Logging to .oc-mirror.log Checking push permissions for ec2-18-117-165-30.us-east-2.compute.amazonaws.com:5000 Found: oc-mirror-workspace/src/publish Found: oc-mirror-workspace/src/v2 Found: oc-mirror-workspace/src/charts Found: oc-mirror-workspace/src/release-signatures backend is not configured in config.yaml, using stateless mode backend is not configured in config.yaml, using stateless mode No metadata detected, creating new workspace panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x2e8a774] goroutine 43 [running]: github.com/containers/image/v5/docker.(*dockerImageSource).Close(0x3?) /go/src/github.com/openshift/oc-mirror/vendor/github.com/containers/image/v5/docker/docker_image_src.go:170 +0x14 github.com/openshift/oc-mirror/pkg/cli/mirror.findFirstAvailableMirror.func1() /go/src/github.com/openshift/oc-mirror/pkg/cli/mirror/fbc_operators.go:449 +0x42 github.com/openshift/oc-mirror/pkg/cli/mirror.findFirstAvailableMirror({0x4c67b38, 0xc0004ca230}, {0xc00ad56000, 0x1, 0x40d19c0?}, {0xc00077e000, 0x94}, {0xc00ac0f6b0, 0x24}, {0x0, ...}) /go/src/github.com/openshift/oc-mirror/pkg/cli/mirror/fbc_operators.go:467 +0x6df github.com/openshift/oc-mirror/pkg/cli/mirror.(*MirrorOptions).addRelatedImageToMapping(0xc0001c0f00, {0x4c67b38, 0xc0004ca230}, 0xc00ac13480?, {{0xc0074a14e8?, 0x18?}, {0xc0076563f0?, 0x8b?}}, {0xc000c5b580, 0x36}) /go/src/github.com/openshift/oc-mirror/pkg/cli/mirror/fbc_operators.go:154 +0x3c5 github.com/openshift/oc-mirror/pkg/cli/mirror.(*OperatorOptions).plan.func3() /go/src/github.com/openshift/oc-mirror/pkg/cli/mirror/operator.go:570 +0x52 golang.org/x/sync/errgroup.(*Group).Go.func1() /go/src/github.com/openshift/oc-mirror/vendor/golang.org/x/sync/errgroup/errgroup.go:75 +0x64 created by golang.org/x/sync/errgroup.(*Group).Go /go/src/github.com/openshift/oc-mirror/vendor/golang.org/x/sync/errgroup/errgroup.go:72 +0xa5
Expected results:
Should not panic
Additional info:
Description of problem:
The fix for https://issues.redhat.com/browse/OCPBUGS-15947 seems to have introduced a problem in our keepalived-monitor logic. What I'm seeing is that at some point all of the apiservers became unavailable, which caused haproxy-monitor to drop the redirect firewall rule since it wasn't able to reach the API and we normally want to fall back to direct, un-loadbalanced API connectivity in that case.
However, due to the fix linked above we now short-circuit the keepalived-monitor update loop if we're unable to retrieve the node list, which is what will happen if the node holding the VIP has neither a local apiserver nor the HAProxy firewall rule. Because of this we will also skip updating the status of the firewall rule and thus the keepalived priority for the node won't be dropped appropriately.
Version-Release number of selected component (if applicable):
We backported the fix linked above to 4.11 so I expect this goes back at least that far.
How reproducible:
Unsure. It's clearly not happening every time, but I have a local dev cluster in this state so it can happen.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
I think the solution here is just to move the firewall rule check earlier in the update loop so it will have run before we try to retrieve nodes. There's no dependency on the ordering of those two steps so I don't foresee any major issues. To workaround this I believe we can just bounce keepalived on the affected node until the VIP ends up on the node with a local apiserver.
Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/94
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
mtls connection is not working when using an intermetiate CA appart from the root CA, both with CRL defined.
The Intermediate CA Cert had a published CDP which directed to a CRL issued by the root CA.
The config map in the openshift-ingress namespace contains the CRL as issued by the root CA. The CRL issued by the Intermediate CA is not present since that CDP is in the user cert and so not in the bundle.
When attempting to connect using a user certificate issued by the Intermediate CA it fails with an error of unknown CA.
When attempting to connect using a user certificate issued by the to Root CA the connection is successful.
Version-Release number of selected component (if applicable):
4.10.24
How reproducible:
Always
Steps to Reproduce:
1. Configure CA and intermediate CA with CRL
2. Sign client certificate with the intermediate CA
3. Configure mtls in openshift-ingress
Actual results:
When attempting to connect using a user certificate issued by the Intermediate CA it fails with an error of unknown CA.
When attempting to connect using a user certificate issued by the to Root CA the connection is successful.
Expected results:
Be able to connect with client certificated signed by the intermediate CA
Additional info:
This is a clone of issue OCPBUGS-13034. The following is the description of the original issue:
—
Description of problem:
Cluster-api pod can't create events due to RBAC. we may miss some useful event due to this.
E0503 07:20:44.925786 1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"ad1-workers-f5f568855-vnzmn.175b911e43aa3f41", GenerateName:"", Namespace:"ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Machine", Namespace:"ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1", Name:"ad1-workers-f5f568855-vnzmn", UID:"2b40a694-d36d-4b13-9afc-0b5daeecc509", APIVersion:"cluster.x-k8s.io/v1beta1", ResourceVersion:"144260357", FieldPath:""}, Reason:"DetectedUnhealthy", Message:"Machine ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1/ad1-workers/ad1-workers-f5f568855-vnzmn/ has unhealthy node ", Source:v1.EventSource{Component:"machinehealthcheck-controller", Host:""}, FirstTimestamp:time.Date(2023, time.May, 3, 7, 20, 44, 923289409, time.Local), LastTimestamp:time.Date(2023, time.May, 3, 7, 20, 44, 923289409, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1:cluster-api" cannot create resource "events" in API group "" in the namespace "ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1"' (will not retry!)
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Create an hosted cluster 2. Check cluster-api pod for some kind of error (e.g. slow node startup) 3.
Actual results:
Error
Expected results:
Event generated
Additional info:
ClusterRole hypershift-cluster-api is created here https://github.com/openshift/hypershift/blob/e7eb32f259b2a01e5bbdddf2fe963b82b331180f/hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go#L2720
We should add create/patch/update for events there
Description of problem:
MetalLB does not work when traffic comes from a secondary nic. The root cause of this failure is net.ipv4.ip_forward flag change from 1 to 0. If we re-enable this flag everything works as expected.
Version-Release number of selected component (if applicable):
Server Version: 4.14.0-0.nightly-2023-07-05-191022
How reproducible:
Run any test case that tests metallb via secondary interface.
Steps to Reproduce:
1. 2. 3.
Actual results:
Test failed
Expected results:
Test Passed
Additional info:
Looks like this PR is the root cause: https://github.com/openshift/machine-config-operator/pull/3676/files#
Description of problem:
when applying a CSV with the current label recommendation for STS, the following error occurs: error creating csv ack-s3-controller.v1.0.3: ClusterServiceVersion.operators.coreos.com "ack-s3-controller.v1.0.3" is invalid: metadata.annotations: Invalid value: "operators.openshift.io/infrastructure-features/token-auth/aws": a qualified name must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name', or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]') with an optional DNS subdomain prefix and '/' (e.g. 'example.com/MyName')
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. create a CSV with an annotation "operators.openshift.io/infrastructure-features/token-auth/aws: `false`" 2. apply the CSV on cluster
Actual results:
fails with the above error
Expected results:
should not fail
Additional info:
Description of problem:
{{}}
vsphereStorageDriver validation error message here is odd when I change LegacyDeprecatedInTreeDriver to "" . I get:
Invalid value: "string": VSphereStorageDriver can not be changed once it is set to CSIWithMigrationDriver
There is no CSIWithMigrationDriver either in the old or new Storage CR.
Version-Release number of selected component (if applicable):
4.13 with this PR: https://github.com/openshift/api/pull/1433
Description of problem:
We have presubmit and periodic jobs failing on : [sig-arch] events should not repeat pathologically for namespace openshift-monitoring { 2 events happened too frequently event happened 21 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/6f9bc9e1d7 - pathological/true reason/RecreatingFailedPod StatefulSet openshift-monitoring/prometheus-k8s is recreating failed Pod prometheus-k8s-1 From: 16:11:36Z To: 16:11:37Z result=reject event happened 22 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/ecfdd1d225 - pathological/true reason/SuccessfulDelete delete Pod prometheus-k8s-1 in StatefulSet prometheus-k8s successful From: 16:11:36Z To: 16:11:37Z result=reject } The failure occurs when the event happens over 20 times. The RecreatingFailedPod reason shows up in 4.14 and Presubmits and does not show up in 4.13.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Run presubmits or periodics; here are latest examples: 2023-05-24 06:25:52.551883+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1661210557367193600 | {aws,amd64,sdn,ha,serial} 2023-05-24 10:20:54.91883+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn-serial/1661267817128792064 | {gcp,amd64,sdn,ha,serial} 2023-05-24 14:17:18.849402+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27899/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1661321663389634560 | {gcp,amd64,ovn,upgrade,upgrade-micro,ha} 2023-05-24 14:17:51.908405+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1583/pull-ci-openshift-kubernetes-master-e2e-azure-ovn-upgrade/1661324100011823104 | {azure,amd64,ovn,upgrade,upgrade-micro,ha}
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
That event/reason should not show up as a failure in the pathological test
Additional info:
This table shows what variants on 4.14 and Presubmits:
variants | test_count --------------------------------------------------+------------ {aws,amd64,ovn,upgrade,upgrade-micro,ha} | 63 {gcp,amd64,ovn,upgrade,upgrade-micro,ha} | 14 {gcp,amd64,sdn,ha,serial,techpreview} | 12 {azure,amd64,sdn,ha,serial,techpreview} | 7 {aws,amd64,sdn,upgrade,upgrade-micro,ha} | 6 {aws,amd64,ovn,ha} | 6 {vsphere-ipi,amd64,ovn,upgrade,upgrade-micro,ha} | 5 {aws,amd64,sdn,ha,serial} | 5 {azure,amd64,ovn,upgrade,upgrade-micro,ha} | 5 {metal-ipi,amd64,ovn,upgrade,upgrade-micro,ha} | 5 {vsphere-ipi,amd64,ovn,ha,serial} | 4 {gcp,amd64,sdn,ha,serial} | 3 {aws,amd64,ovn,single-node} | 3 {metal-ipi,amd64,ovn,ha,serial} | 2 {aws,amd64,ovn,ha,serial} | 2 {aws,amd64,upgrade,upgrade-micro,ha} | 1 {aws,arm64,sdn,ha,serial} | 1 {aws,arm64,ovn,ha,serial,techpreview} | 1 {vsphere-ipi,amd64,ovn,ha,serial,techpreview} | 1 {aws,amd64,sdn,ha,serial,techpreview} | 1 {libvirt,ppc64le,ovn,ha,serial} | 1 {amd64,upgrade,upgrade-micro,ha} | 1
Just for my record, I'm using this query to check 4.14 and Presubmits:
SELECT rt.created_at, url, variants FROM prow_jobs pj JOIN prow_job_runs r ON r.prow_job_id = pj.id JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id JOIN tests ON rt.test_id = tests.id WHERE pj.release IN ('4.14', 'Presubmits') AND rt.status = 12 AND tests.id = 65991 AND o.output LIKE '%RecreatingFailedPod%' ORDER BY rt.created_at, variants DESC;
And this query for checking 4.13:
SELECT rt.created_at, url, variants FROM prow_jobs pj JOIN prow_job_runs r ON r.prow_job_id = pj.id JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id JOIN tests ON rt.test_id = tests.id WHERE pj.release IN ('4.13') AND rt.status = 12 AND tests.id IN (65991, 244,245) AND o.output LIKE '%RecreatingFailedPod%' ORDER BY rt.created_at, variants DESC;
This shows jobs beginning on 4/13 to today.
Description of problem:
when viewing servicemonitor schema in YAML sidebar, for many fields whose type is Object, console doesn't have a 'View details' button to show more details
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-12-044657
How reproducible:
Always
Steps to Reproduce:
1. goes to any ServiceMonitor yaml page, open Schema by clicking on 'View sidebar' click 'View details' of 'spec' -> click 'View details' of 'endpoints' 2. Check object and array type schema spec.endpoints.authorization spec.endpoints.basicAuth spec.endpoints.bearerTokenSecret spec.endpoints.oauth2 spec.endpoints.params spec.endpoints.tlsConfig spec.endpoints.relabelings
Actual results:
2. there is no 'View details' button for these 'object' and 'array' type field
Expected results:
2. we should provide 'View details' link for 'object' and 'array' fields so that user has ability to view more details For example $ oc explain servicemonitors.spec.endpoints.tlsConfig KIND: ServiceMonitor VERSION: monitoring.coreos.com/v1RESOURCE: tlsConfig <Object>DESCRIPTION: TLS configuration to use when scraping the endpointFIELDS: ca <Object> Certificate authority used when verifying server certificates. caFile <string> Path to the CA cert in the Prometheus container to use for the targets. cert <Object> Client certificate to present when doing client-authentication. certFile <string> Path to the client cert file in the Prometheus container for the targets. insecureSkipVerify <boolean> Disable target certificate validation. keyFile <string> Path to the client key file in the Prometheus container for the targets. keySecret <Object> Secret containing the client key file for the targets. serverName <string> Used to verify the hostname for the targets. oc explain servicemonitors.spec.endpoints.relabelings KIND: ServiceMonitor VERSION: monitoring.coreos.com/v1RESOURCE: relabelings <[]Object>DESCRIPTION: RelabelConfigs to apply to samples before scraping. Prometheus Operator automatically adds relabelings for a few standard Kubernetes fields. The original scrape job's name is available via the `__tmp_prometheus_job_name` label. More info: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config RelabelConfig allows dynamic rewriting of the label set, being applied to samples before ingestion. It defines `<metric_relabel_configs>`-section of Prometheus configuration. More info: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#metric_relabel_configsFIELDS: action <string> Action to perform based on regex matching. Default is 'replace'. uppercase and lowercase actions require Prometheus >= 2.36. modulus <integer> Modulus to take of the hash of the source label values. regex <string> Regular expression against which the extracted value is matched. Default is '(.*)' replacement <string> Replacement value against which a regex replace is performed if the regular expression matches. Regex capture groups are available. Default is '$1' separator <string> Separator placed between concatenated source label values. default is ';'. sourceLabels <[]string> The source labels select values from existing labels. Their content is concatenated using the configured separator and matched against the configured regular expression for the replace, keep, and drop actions. targetLabel <string> Label to which the resulting value is written in a replace action. It is mandatory for replace actions. Regex capture groups are available.
Additional info:
Description of problem:
`rprivate` default mount propagation in combination with `hostPath: path: /` breaks CSI driver relying on multipath
How reproducible:
Always
Steps to Reproduce (simplified):
1. ssh to node, 2. mount a partition (for instance) /dev/{s,v}da2 which on CoreOs is an UEFI FAT partition $ sudo mount /dev/vda2 /mnt 3. start a debug pod on that node ( or any pod that does a hostPath mount of /, like the node tuning operand pod, the machine config operand, the filesystem integrity operand ) $ oc debug nodes/master-2.sharedocp4upi411ovn.lab.upshift.rdu2.redhat.com 4. unmount the partition on node 5. notice the debug pod still has a reference to the filesystem grep vda2 /proc/*/mountinfo /proc/3687945/mountinfo:11219 10837 252:2 / /host/var/mnt rw,relatime - vfat /dev/vda2 rw,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro 6. On the node, although the mount is absent from /proc/mounts, the file system is still mounted, as shown by the dirty bit being still set on the FAT filesystem: sudo fsck -n /dev/vda2 fsck from util-linux 2.32.1 fsck.fat 4.1 (2017-01-24) 0x25: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
Expected results:
File system is unmounted in host and in container.
Additional info:
Although the steps above show the behaviour in a simple way, this becomes quite problematic when using multipath on a host mount.
We noticed in a customer environment that we cannot reschedule some pods from old node to new node using oc adm drain when these pods have a Persistent Volume mount created by the third party CSI driver block.csi.ibm.com.
The CSI driver is using multipath from CoreOS to manage multipath block devices, however the multipath daemon blocks the volume removal from the node (the multipath -f flushing calls from the CSI driver always return busy. Flushing a multiple device means removing it from the device tree in /dev in storage parlance)
multipath flush are always failing because although the multipath block device is unmounted on the host, machine-config, file integrity, node tuning pods are doing hostPath volume mounts of /, the host root filesystem.
and thus get a copy of the mounts.
Due to that mount copy the kernel sees the filesystem is still in use, although there a no file descriptors open on that filesyste, and considers it is unsafe to remove the multipath block device, and the node CSI driver cannot finish the unmount of the volume, thus blocking the container creation on another node.
We can see this mount copies by looking at /proc/<container pid>/mountinfo:
$ grep mpathes proc/*/mountinfo
proc/3295781/mountinfo:56348 52693 253:42 / /var/lib/kubelet/plugins/kubernetes.io/csi/block.csi.ibm.com/12345/globalmount rw,relatime - xfs /dev/mapper/mpathes rw,seclabel,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota
cri-o is doing this mount copy using `rprivate` mount propagation
( see https://github.com/cri-o/cri-o/blob/b098bec2d4d79bdf99c3ce89b0eeb16bfe8b5645/server/container_create_linux.go#L1030 )
the semantics of rprivate are mapped in`runc`
https://github.com/opencontainers/runc/blob/ba58ee9c3b9550c3e32b94802b0fb29761955290/libcontainer/specconv/spec_linux.go#L55
to mount flags passed to the mount(2) systemcall
MS_REC (since Linux 2.4.11) Used in conjunction with MS_BIND to create a recursive bind mount, and in conjunction with the propagation type flags to recursively change the propa‐ gation type of all of the mounts in a subtree. See below for further de‐ tails. MS_PRIVATE Make this mount private. Mount and unmount events do not propagate into or out of this mount.
the key is the MS_PRIVATE mount here. The unmounting of the multipath block device is not propagated to the mount namespace of containers, thus keeping the filesystem eternally mounted, preventing the flushing of the multipath device.
Maybe hostPath mounts should be done using `rslave` mount propagation, when we see we try to bind mount /var/lib ?
Seems cri-dockerd is doing something similar according to https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation
We should be able to add the repository supporting basic auth
Documentation Requirement: Yes/No (needs-docs|upstream-docs / no-doc)
Upstream: <Inputs/Requirement details>/ Not Applicable
Downstream: <Type: Doc defect/More inputs to doc>/ Not Applicable
Provide link to the relevant section
Provide doc inputs and details required
Release Notes Type: <New Feature/Enhancement/Known Issue/Bug
fix/Breaking change/Deprecated Functionality/Technology Preview>
LatencySensitive has been functionally equivalent to "" (Default) for several years. Code has forgotten that the featureset must be handled and its more efficacious to remove the featureset (with migration code) than try to plug all the holes.
To ensure this is working, update a cluster to use LatencySensitve and see that the FEatureSet value is reset after two minutes
Description of problem:
In 4.10 we added an option REGISTRY_AUTH_PREFERENCE to opt-in for podman registry auth file prefence reading order. This is important for oc registry commands like oc registry login and oc image. https://github.com/openshift/oc/pull/893 We also started warning users that we will remove support for docker order and default to podman order - meaning we will check podman locations first and then we will fallback to docker locations.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
We should default to podman auth file locations and remove a warning when using oc registry login or oc image commands without REGISTRY_AUTH_PREFERENCE variable.
Additional info:
Description of problem:
During an operator installation with the Installation mode set to all namespaces, the "Installed Namespace" dropdown selection is restricted to "openshift-operators" or another specific namespace, if one is recommended by the operator owners.
With to recent* change to allow non-latest operator version installs, users should be allowed to select any namespace to install a globally installed operator.
Related info:
Operators can now be installed on non-latest versions with the merge of * https://github.com/openshift/console/pull/12743 They require a manual approval and because of the way InstallPlan upgrades work, this effects all operators installed that namespace.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-19411. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.oc -n openshift-machine-api get role/cluster-autoscaler-operator -o yaml 2. Observe missing watch verb 3. Tail cluster-autoscaler logs to see error status.go:444] No ClusterAutoscaler. Reporting available. I0919 16:40:52.877216 1 status.go:244] Operator status available: at version 4.14.0-rc.1 E0919 16:40:53.719592 1 reflector.go:148] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: Failed to watch *v1.ClusterOperator: unknown (get clusteroperators.config.openshift.io)
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-18439. The following is the description of the original issue:
—
Description of problem:
In the developer sandbox, the happy path to create operator-backed resources is broken. Users can only work on their assigned namespace. When doing so, and attempting to create an Operator-backed resource from the Developer console, the user interface switches inadvertendly the working namespace from the user's to the `openshift` one. The console shows an error message when the user clicks the "create" button.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Login to the Developer Sandbox 2. Choose the Developer view 3. Click Add+ -> Developer Catalog -> Operator Backed 4. Filter by "integration" 5. Notice the working namespace is still the user's one. 6. Select "Integration" (Camel K operator) 7. Click "Create" 8. Notice the working namespace has switched to `openshift` 9. Notice the custom resource in YAML view includes `namespace: openshift` 10. Click "Create"
Actual results:
An error message shows: "Danger alert:An error occurredintegrations.camel.apache.org is forbidden: User "bmesegue" cannot create resource "integrations" in API group "camel.apache.org" in the namespace "openshift""
Expected results:
On step 8, the working directory should remain the user's one On step 9, in the YAML view, the namespace should be the user's one, or none. After step 10, the creation process should trigger the creation of a Camel K integration.
Additional info:
The code in our infrastructure test needs to be updated to make the test more accurate. Currently we are targeting gomock.any() in many cases, this means that the tests are not as accurate as they could be.
Updates should be similar to MGMT-13918
Description of the problem:
In Staging, UI 2.18.6 - Enable DHCP and then switch to UMN --> BE response "User Managed Networking cannot be set with VIP DHCP Allocation"
How reproducible:
100%
Steps to reproduce:
1. In networking page - enable DHCP
2. Switch to UMN
3. BE response with "User Managed Networking cannot be set with VIP DHCP Allocation"
Actual results:
Expected results:
Description of problem:
Install cert-manager operator of version cert-manager-operator-bundle:v1.11.1-6 from console, the UI shown version slips between from v1.11.1 and v1.10.2 and v1.11.1 again and v1.10.2 again ... constantly.
Version-Release number of selected component (if applicable):
cert-manager-operator-bundle:v1.11.1-6, 4.13.0-0.nightly-2023-05-18-195839
How reproducible:
Always. I tried a few times in different envs, double confirmed.
Steps to Reproduce:
1. Install cert-manager operator of version cert-manager-operator-bundle:v1.11.1-6 from console 2. Watch console 3.
Actual results:
The UI shown version slips between from v1.11.1 and v1.10.2 and v1.11.1 again and v1.10.2 again ... constantly. See attached video https://drive.google.com/drive/folders/1AFWquCK-pDCoQFMEOONQwGByBUg6tKR9?usp=sharing .
Expected results:
Should always show v1.11.1
Additional info:
No matter using index image v4.13 brew.registry.redhat.io/rh-osbs/iib:500235 (gotten from email "[CVP] (SUCCESS) (cvp-redhatopenshiftcfe: cert-manager-operator-bundle-container-v1.11.1-6)") or brew.registry.redhat.io/rh-osbs/iib-pub-pending:v4.13, both reproduced it.
Description of problem: the per-node certificates should be a configurable duration
Description of problem:
When CNO is managed by Hypershift, it's deployment has "hypershift.openshift.io/release-image" template metadata annotation. The annotation's value is used to track progress of cluster control plane version upgrades. But multus-admission-controller created and managed by CNO does not have that annotation so service providers are not able to track its version upgrades. The proposed solution is for CNO to propagate its "hypershift.openshift.io/release-image" annotation down to the multus-admission-controller deployment. For that CNO need to have "get" access to its own deployment manifest to be able to read the deployment template metadata annotations. Hypershift needs code change to assign CNO "get" permission on the CNO deployment object.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1.Create OCP cluster using Hypershift 2.Check deployment template metadata annotations on multus-admission-controller
Actual results:
No "hypershift.openshift.io/release-image" deployment template metadata annotation exists
Expected results:
"hypershift.openshift.io/release-image" annotation must be present
Additional info:
Description of problem:
When setting no configuration for node-exporter in CMO config, we did not see the 2 arguments collector.netclass.ignored-devices and collector.netdev.device-exclude in node-exporter daemonset, full info see: http://pastebin.test.redhat.com/1093428 and checked in 4.13.0-0.nightly-2023-02-27-101545, no configuration for node-exporter, there is collector.netclass.ignored-devices setting see from: http://pastebin.test.redhat.com/1093429 after disabled netdev/netclass on bot cluster, would see collector.netclass.ignored-devices and collector.netdev.device-exclude settings in node-exporter, since OCPBUGS-7282 is filed on 4.12, disable netdev/netclass is not supported then, I don't think we should disable netdev/netclass $ oc -n openshift-monitoring get ds node-exporter -oyaml | grep collector - --no-collector.wifi - --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run/k3s/containerd/.+|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/) - --collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*|cali[a-f0-9]*)$ - --collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*|cali[a-f0-9]*)$ - --collector.cpu.info - --collector.textfile.directory=/var/node_exporter/textfile - --no-collector.cpufreq - --no-collector.tcpstat - --no-collector.netdev - --no-collector.netclass - --no-collector.buddyinfo - '[[ ! -d /node_exporter/collectors/init ]] || find /node_exporter/collectors/init
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Steps to Reproduce:
The 2 arguments are missing when booting up OCP with default configurations for CMO.
Actual results:
The 2 arguments collector.netclass.ignored-devices and collector.netdev.device-exclude are missing in node-exporter DaemonSet.
Expected results:
The 2 arguments collector.netclass.ignored-devices and collector.netdev.device-exclude are present in node-exporter DaemonSet.
Additional info:
Description of problem:
OpenShift Container Platform 4.12.5 installation with IPI installation method on Microsoft Azure is showing undesired behavior when trying to curl "https://api.<clustername>.<domain>:6443/readyz". When using `HostNetwork` it all works without any issues. But when doing the same request from a pod that does not have `HostNetwork` capabilties and therefore has an IP from the SDN range, a big portion of the requests is failing. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.5 True False 29m Cluster version is 4.12.5 $ oc get network cluster -o yaml apiVersion: config.openshift.io/v1 kind: Network metadata: creationTimestamp: "2023-03-10T13:12:06Z" generation: 2 name: cluster resourceVersion: "2975" uid: e1e9c464-526c-4ebf-ab84-0deedf092cac spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 externalIP: policy: {} networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 status: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 clusterNetworkMTU: 1400 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 $ oc get infrastructure cluster -o yaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-03-10T13:12:04Z" generation: 1 name: cluster resourceVersion: "430" uid: 5c260276-d901-40f7-a28c-172c492e81e6 spec: cloudConfig: key: config name: cloud-provider-config platformSpec: type: Azure status: apiServerInternalURI: https://api-int.clustername.domain.lab:6443 apiServerURL: https://api.clustername.domain.lab:6443 controlPlaneTopology: HighlyAvailable etcdDiscoveryDomain: "" infrastructureName: sreberazure-njj24 infrastructureTopology: HighlyAvailable platform: Azure platformStatus: azure: cloudName: AzurePublicCloud networkResourceGroupName: sreberazure-njj24-rg resourceGroupName: sreberazure-njj24-rg type: Azure $ oc project openshift-apiserver Already on project "openshift-apiserver" on server "https://api.clustername.domain.lab:6443". $ oc get pod NAME READY STATUS RESTARTS AGE apiserver-6f58784797-kq4kr 2/2 Running 0 41m apiserver-6f58784797-l69jr 2/2 Running 0 38m apiserver-6f58784797-nn6tn 2/2 Running 0 45m $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES apiserver-6f58784797-kq4kr 2/2 Running 0 42m 10.130.0.21 sreberazure-njj24-master-0 <none> <none> apiserver-6f58784797-l69jr 2/2 Running 0 38m 10.129.0.29 sreberazure-njj24-master-2 <none> <none> apiserver-6f58784797-nn6tn 2/2 Running 0 45m 10.128.0.36 sreberazure-njj24-master-1 <none> <none> $ oc rsh apiserver-6f58784797-l69jr Defaulted container "openshift-apiserver" out of: openshift-apiserver, openshift-apiserver-check-endpoints, fix-audit-permissions (init) sh-4.4# while true; do curl -k --connect-timeout 1 https://api.clustername.domain.lab:6443/readyz; sleep 1; done curl: (28) Connection timed out after 1000 milliseconds okokokcurl: (28) Connection timed out after 1001 milliseconds okokcurl: (28) Connection timed out after 1003 milliseconds curl: (28) Connection timed out after 1001 milliseconds curl: (28) Connection timed out after 1001 milliseconds okokokokokokokokokcurl: (28) Connection timed out after 1001 milliseconds okokcurl: (28) Connection timed out after 1001 milliseconds curl: (28) Connection timed out after 1001 milliseconds ^C sh-4.4# exit exit command terminated with exit code 130 $ oc project openshift-kube-apiserver Now using project "openshift-kube-apiserver" on server "https://api.clustername.domain.lab:6443". $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES apiserver-watcher-sreberazure-njj24-master-0 1/1 Running 0 55m 10.0.0.6 sreberazure-njj24-master-0 <none> <none> apiserver-watcher-sreberazure-njj24-master-1 1/1 Running 0 57m 10.0.0.8 sreberazure-njj24-master-1 <none> <none> apiserver-watcher-sreberazure-njj24-master-2 1/1 Running 0 57m 10.0.0.7 sreberazure-njj24-master-2 <none> <none> installer-2-sreberazure-njj24-master-2 0/1 Completed 0 51m 10.129.0.27 sreberazure-njj24-master-2 <none> <none> installer-3-sreberazure-njj24-master-2 0/1 Completed 0 50m 10.129.0.32 sreberazure-njj24-master-2 <none> <none> installer-4-sreberazure-njj24-master-2 0/1 Completed 0 49m 10.129.0.36 sreberazure-njj24-master-2 <none> <none> installer-5-sreberazure-njj24-master-2 0/1 Completed 0 46m 10.129.0.15 sreberazure-njj24-master-2 <none> <none> installer-6-sreberazure-njj24-master-0 0/1 Completed 0 37m 10.130.0.27 sreberazure-njj24-master-0 <none> <none> installer-6-sreberazure-njj24-master-1 0/1 Completed 0 39m 10.128.0.45 sreberazure-njj24-master-1 <none> <none> installer-6-sreberazure-njj24-master-2 0/1 Completed 0 36m 10.129.0.37 sreberazure-njj24-master-2 <none> <none> kube-apiserver-guard-sreberazure-njj24-master-0 1/1 Running 0 37m 10.130.0.29 sreberazure-njj24-master-0 <none> <none> kube-apiserver-guard-sreberazure-njj24-master-1 1/1 Running 0 38m 10.128.0.47 sreberazure-njj24-master-1 <none> <none> kube-apiserver-guard-sreberazure-njj24-master-2 1/1 Running 0 50m 10.129.0.31 sreberazure-njj24-master-2 <none> <none> kube-apiserver-sreberazure-njj24-master-0 5/5 Running 0 37m 10.0.0.6 sreberazure-njj24-master-0 <none> <none> kube-apiserver-sreberazure-njj24-master-1 5/5 Running 0 38m 10.0.0.8 sreberazure-njj24-master-1 <none> <none> kube-apiserver-sreberazure-njj24-master-2 5/5 Running 0 34m 10.0.0.7 sreberazure-njj24-master-2 <none> <none> revision-pruner-6-sreberazure-njj24-master-0 0/1 Completed 0 33m 10.130.0.35 sreberazure-njj24-master-0 <none> <none> revision-pruner-6-sreberazure-njj24-master-1 0/1 Completed 0 33m 10.128.0.56 sreberazure-njj24-master-1 <none> <none> revision-pruner-6-sreberazure-njj24-master-2 0/1 Completed 0 33m 10.129.0.39 sreberazure-njj24-master-2 <none> <none> $ oc rsh kube-apiserver-sreberazure-njj24-master-1 sh-4.4# while true; do curl -k --connect-timeout 1 https://api.clustername.domain.lab:6443/readyz; sleep 1; done okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokok Also changing `--connect-timeout 1` from curl to `--connect-timeout 10` for example does not have any impact. It simply takes longer until the timeout is reached.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.12 (also previous version were not tested)
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4.12 on Azure using IPI install method and set the SDN to OVN-Kubernetes 2. Once successfully installed run `oc project openshift-apiserver` 3. rsh apiserver-<podID> 4. while true; do curl -k --connect-timeout 1 https://api.clustername.domain.lab:6443/readyz; sleep 1; done
Actual results:
sh-4.4# while true; do curl -k --connect-timeout 1 https://api.clustername.domain.lab:6443/readyz; sleep 1; done curl: (28) Connection timed out after 1000 milliseconds okokokcurl: (28) Connection timed out after 1001 milliseconds okokcurl: (28) Connection timed out after 1003 milliseconds curl: (28) Connection timed out after 1001 milliseconds curl: (28) Connection timed out after 1001 milliseconds okokokokokokokokokcurl: (28) Connection timed out after 1001 milliseconds okokcurl: (28) Connection timed out after 1001 milliseconds curl: (28) Connection timed out after 1001 milliseconds
Expected results:
sh-4.4# while true; do curl -k --connect-timeout 1 https://api.clustername.domain.lab:6443/readyz; sleep 1; done okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokok
Additional info:
Follow up for https://issues.redhat.com/browse/HOSTEDCP-969
Create metrics and grafana panel in
https://github.com/openshift/hypershift/tree/main/contrib/metrics
for NodePool internal SLOs/SLIs:
Move existing metrics when possible from metrics loop into nodepool controller:
- nodePoolSize
Explore and discuss granular metrics to track NodePool lifecycle bottle necks, infra, ignition, node networking, available. Consolidate that with hostedClusterTransitionSeconds metrics and dashboard panels
Explore and discuss metrics for upgrade duration SLO for both HC and NodePool.
Description of problem:
OCP 4.13 uses a release candidate v3.0.0-rc.1 of vsphere-csi-driver. We should ship OCp with a GA version
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-17-161027
Trying to update my cluster from 4.12.0 to 4.12.2 and this resulted in a crashlooping state for both prometheus adapter pods. Tried to downgrade back to 4.12.0 and then upgrade to 4.12.4 but neither approach solved the situation.
What I can see in the logs of the adapters is the following:
I0216 15:24:59.144559 1 adapter.go:114] successfully using in-cluster auth I0216 15:25:00.345620 1 request.go:601] Waited for 1.180640418s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/operators.coreos.com/v1alpha1?timeout=32s I0216 15:25:10.345634 1 request.go:601] Waited for 11.180149045s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/triggers.tekton.dev/v1beta1?timeout=32s I0216 15:25:20.346048 1 request.go:601] Waited for 2.597453714s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/apiextensions.k8s.io/v1?timeout=32s I0216 15:25:30.347435 1 request.go:601] Waited for 12.598768922s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1?timeout=32s I0216 15:25:40.545767 1 request.go:601] Waited for 22.797001115s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/samples.operator.openshift.io/v1?timeout=32s I0216 15:25:50.546588 1 request.go:601] Waited for 32.797748538s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/metrics.k8s.io/v1beta1?timeout=32s I0216 15:25:56.041594 1 secure_serving.go:210] Serving securely on [::]:6443 I0216 15:25:56.042265 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/etc/tls/private/tls.crt::/etc/tls/private/tls.key" I0216 15:25:56.042971 1 dynamic_cafile_content.go:157] "Starting controller" name="request-header::/etc/tls/private/requestheader-client-ca-file" I0216 15:25:56.043309 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" I0216 15:25:56.043310 1 object_count_tracker.go:84] "StorageObjectCountTracker pruner is exiting" I0216 15:25:56.043398 1 dynamic_serving_content.go:146] "Shutting down controller" name="serving-cert::/etc/tls/private/tls.crt::/etc/tls/private/tls.key" I0216 15:25:56.043562 1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController" I0216 15:25:56.043606 1 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/etc/tls/private/client-ca-file" I0216 15:25:56.043614 1 secure_serving.go:255] Stopped listening on [::]:6443 I0216 15:25:56.043621 1 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/tls/private/client-ca-file" I0216 15:25:56.043635 1 dynamic_cafile_content.go:171] "Shutting down controller" name="request-header::/etc/tls/private/requestheader-client-ca-file"
I also tried to search online for known issues and bugs and found this one that might be related:
https://github.com/kubernetes-sigs/metrics-server/issues/983
I also tried rebooting the server but it didn't help.
Need a workaround at least because at the moment the cluster is still in a pending stage.
Description of problem:
Following https://bugzilla.redhat.com/show_bug.cgi?id=2102765 respectively https://issues.redhat.com/browse/OCPBUGS-2140 problems with OpenID Group sync have been resolved. Yet the problem documented in https://bugzilla.redhat.com/show_bug.cgi?id=2102765 still does exist and we see that Groups that are being removed are still part of the chache in oauth-apiserver, causing a panic of the respective components and failures during login for potentially affected users. So in general, it looks like that oauth-apiserver cache is not properly refreshing or handling the OpenID Groups being synced. E1201 11:03:14.625799 1 runtime.go:76] Observed a panic: interface conversion: interface {} is nil, not *v1.Group goroutine 3706798 [running]: k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1() k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:103 +0xb0 panic({0x1aeab00, 0xc001400390}) runtime/panic.go:838 +0x207 k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1.1() k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:80 +0x2a k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1() k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:89 +0x250 panic({0x1aeab00, 0xc001400390}) runtime/panic.go:838 +0x207 github.com/openshift/library-go/pkg/oauth/usercache.(*GroupCache).GroupsFor(0xc00081bf18?, {0xc000c8ac03?, 0xc001400360?}) github.com/openshift/library-go@v0.0.0-20211013122800-874db8a3dac9/pkg/oauth/usercache/groups.go:47 +0xe7 github.com/openshift/oauth-server/pkg/groupmapper.(*UserGroupsMapper).processGroups(0xc0002c8880, {0xc0005d4e60, 0xd}, {0xc000c8ac03, 0x7}, 0x1?) github.com/openshift/oauth-server/pkg/groupmapper/groupmapper.go:101 +0xb5 github.com/openshift/oauth-server/pkg/groupmapper.(*UserGroupsMapper).UserFor(0xc0002c8880, {0x20f3c40, 0xc000e18bc0}) github.com/openshift/oauth-server/pkg/groupmapper/groupmapper.go:83 +0xf4 github.com/openshift/oauth-server/pkg/oauth/external.(*Handler).login(0xc00022bc20, {0x20eebb0, 0xc00041b058}, 0xc0015d8200, 0xc001438140?, {0xc0000e7ce0, 0x150}) github.com/openshift/oauth-server/pkg/oauth/external/handler.go:209 +0x74f github.com/openshift/oauth-server/pkg/oauth/external.(*Handler).ServeHTTP(0xc00022bc20, {0x20eebb0, 0xc00041b058}, 0x0?) github.com/openshift/oauth-server/pkg/oauth/external/handler.go:180 +0x74a net/http.(*ServeMux).ServeHTTP(0x1c9dda0?, {0x20eebb0, 0xc00041b058}, 0xc0015d8200) net/http/server.go:2462 +0x149 github.com/openshift/oauth-server/pkg/server/headers.WithRestoreAuthorizationHeader.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200) github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:27 +0x10f net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200) k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5 net/http.HandlerFunc.ServeHTTP(0xc0005e0280?, {0x20eebb0?, 0xc00041b058?}, 0x0?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200) k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/authorization.go:64 +0x498 net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200) k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178 net/http.HandlerFunc.ServeHTTP(0x2f6cea0?, {0x20eebb0?, 0xc00041b058?}, 0x3?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200) k8s.io/apiserver@v0.22.2/pkg/server/filters/maxinflight.go:187 +0x2a4 net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200) k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5 net/http.HandlerFunc.ServeHTTP(0x11?, {0x20eebb0?, 0xc00041b058?}, 0x1aae340?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filters.WithImpersonation.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200) k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/impersonation.go:50 +0x21c net/http.HandlerFunc.ServeHTTP(0xc000d52120?, {0x20eebb0?, 0xc00041b058?}, 0x0?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200) k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178 net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200) k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5 net/http.HandlerFunc.ServeHTTP(0xc0015d8100?, {0x20eebb0?, 0xc00041b058?}, 0xc000531930?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1({0x7fae682a40d8?, 0xc00041b048}, 0x9dbbaa?) k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:111 +0x549 net/http.HandlerFunc.ServeHTTP(0xc00003def0?, {0x7fae682a40d8?, 0xc00041b048?}, 0x0?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100) k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178 net/http.HandlerFunc.ServeHTTP(0x0?, {0x7fae682a40d8?, 0xc00041b048?}, 0x0?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100) k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5 net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x7fae682a40d8?, 0xc00041b048?}, 0x20cfd00?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filters.withAuthentication.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100) k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/authentication.go:80 +0x8b9 net/http.HandlerFunc.ServeHTTP(0x20f0f20?, {0x7fae682a40d8?, 0xc00041b048?}, 0x20cfc08?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc000e69e00) k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:88 +0x46b net/http.HandlerFunc.ServeHTTP(0xc0019f5890?, {0x7fae682a40d8?, 0xc00041b048?}, 0xc000848764?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/server/filters.WithCORS.func1({0x7fae682a40d8, 0xc00041b048}, 0xc000e69e00) k8s.io/apiserver@v0.22.2/pkg/server/filters/cors.go:75 +0x10b net/http.HandlerFunc.ServeHTTP(0xc00149a380?, {0x7fae682a40d8?, 0xc00041b048?}, 0xc0008487d0?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1() k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:108 +0xa2 created by k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:94 +0x2cc goroutine 3706802 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x19eb780?, 0xc001206e20}) k8s.io/apimachinery@v0.22.2/pkg/util/runtime/runtime.go:74 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0016aec60, 0x1, 0x1560f26?}) k8s.io/apimachinery@v0.22.2/pkg/util/runtime/runtime.go:48 +0x75 panic({0x19eb780, 0xc001206e20}) runtime/panic.go:838 +0x207 k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc0005047c8, {0x20eecd0?, 0xc0010fae00}, 0xdf8475800?) k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:114 +0x452 k8s.io/apiserver/pkg/endpoints/filters.withRequestDeadline.func1({0x20eecd0, 0xc0010fae00}, 0xc000e69d00) k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/request_deadline.go:101 +0x494 net/http.HandlerFunc.ServeHTTP(0xc0016af048?, {0x20eecd0?, 0xc0010fae00?}, 0xc0000bc138?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1({0x20eecd0?, 0xc0010fae00}, 0xc000e69d00) k8s.io/apiserver@v0.22.2/pkg/server/filters/waitgroup.go:59 +0x177 net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x7fae705daff0?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filters.WithAuditAnnotations.func1({0x20eecd0, 0xc0010fae00}, 0xc000e69c00) k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit_annotations.go:37 +0x230 net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x20cfc08?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filters.WithWarningRecorder.func1({0x20eecd0?, 0xc0010fae00}, 0xc000e69b00) k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/warning.go:35 +0x2bb net/http.HandlerFunc.ServeHTTP(0x1c9dda0?, {0x20eecd0?, 0xc0010fae00?}, 0xd?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1({0x20eecd0, 0xc0010fae00}, 0x0?) k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/cachecontrol.go:31 +0x126 net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x20cfc08?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/server/httplog.WithLogging.func1({0x20ef480?, 0xc001c20620}, 0xc000e69a00) k8s.io/apiserver@v0.22.2/pkg/server/httplog/httplog.go:103 +0x518 net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20ef480?, 0xc001c20620?}, 0x20cfc08?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1({0x20ef480, 0xc001c20620}, 0xc000e69900) k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/requestinfo.go:39 +0x316 net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20ef480?, 0xc001c20620?}, 0xc0007c3f70?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filters.withRequestReceivedTimestampWithClock.func1({0x20ef480, 0xc001c20620}, 0xc000e69800) k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/request_received_time.go:38 +0x27e net/http.HandlerFunc.ServeHTTP(0x419e2c?, {0x20ef480?, 0xc001c20620?}, 0xc0007c3e40?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1({0x20ef480?, 0xc001c20620?}, 0xc0004ff600?) k8s.io/apiserver@v0.22.2/pkg/server/filters/wrap.go:74 +0xb1 net/http.HandlerFunc.ServeHTTP(0x1c05260?, {0x20ef480?, 0xc001c20620?}, 0x8?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/endpoints/filters.withAuditID.func1({0x20ef480, 0xc001c20620}, 0xc000e69600) k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/with_auditid.go:66 +0x40d net/http.HandlerFunc.ServeHTTP(0x1c9dda0?, {0x20ef480?, 0xc001c20620?}, 0xd?) net/http/server.go:2084 +0x2f github.com/openshift/oauth-server/pkg/server/headers.WithPreserveAuthorizationHeader.func1({0x20ef480, 0xc001c20620}, 0xc000e69600) github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:16 +0xe8 net/http.HandlerFunc.ServeHTTP(0xc0016af9d0?, {0x20ef480?, 0xc001c20620?}, 0x16?) net/http/server.go:2084 +0x2f github.com/openshift/oauth-server/pkg/server/headers.WithStandardHeaders.func1({0x20ef480, 0xc001c20620}, 0x4d55c0?) github.com/openshift/oauth-server/pkg/server/headers/headers.go:30 +0x18f net/http.HandlerFunc.ServeHTTP(0x0?, {0x20ef480?, 0xc001c20620?}, 0xc0016afac8?) net/http/server.go:2084 +0x2f k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc00098d622?, {0x20ef480?, 0xc001c20620?}, 0xc000401000?) k8s.io/apiserver@v0.22.2/pkg/server/handler.go:189 +0x2b net/http.serverHandler.ServeHTTP({0xc0019f5170?}, {0x20ef480, 0xc001c20620}, 0xc000e69600) net/http/server.go:2916 +0x43b net/http.(*conn).serve(0xc0002b1720, {0x20f0f58, 0xc0001e8120}) net/http/server.go:1966 +0x5d7 created by net/http.(*Server).Serve net/http/server.go:3071 +0x4db
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.11.13
How reproducible:
- Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4.11 2. Configure OpenID Group Sync (as per https://docs.openshift.com/container-platform/4.11/authentication/identity_providers/configuring-oidc-identity-provider.html#identity-provider-oidc-CR_configuring-oidc-identity-provider) 3. Have users with hundrets of groups 4. Login and after a while, remove some Groups from the user in the IDP and from OpenShift Container Platform 5. Try to login again and see the panic in oauth-apiserver
Actual results:
User is unable to login and oauth pods are reporting a panic as shown above
Expected results:
oauth-apiserver should invalidate the cache quickly to remove potential invalid references to non exsting groups
Additional info:
Description of problem:
In certain cases, an AWS cluster running 4.12 doesn't automatically generate a controlplanemachineset when it's expected to. It looks like CPMS is looking for `infrastructure.Spec.PlatformSpec.Type` (https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/2aeaaf9ec714ee75f933051c21a44f648d6ed42b/pkg/controllers/controlplanemachinesetgenerator/controller.go#L180) and as result, clusters born earlier than 4.5 when this field was introduced (https://github.com/openshift/installer/pull/3277) will not be able to generate a CPMS. I believe we should be looking at `infrastructure.Status.PlatformStatus.Type` instead
Version-Release number of selected component (if applicable):
4.12.9
How reproducible:
Consistent
Steps to Reproduce:
1. Install a cluster on a version earlier than 4.5 2. Upgrade cluster through to 4.12 3. Observe "Unable to generate control plane machine set, unsupported platform" error message from the control-plane-machine-set-operator, as well as the missing CPMS object in the openshift-machine-api namespace
Actual results:
No generated CPMS is created, despite the platform being AWS
Expected results:
A generated CPMS existing in the openshift-machine-api namespace
Additional info:
Description of problem:
Running `yarn dev` results in the build running on a loop. This issue appears to be related to changes in https://github.com/openshift/console/pull/12821.
How reproducible:
Always
Steps to Reproduce:
1. Run `yarn dev` 2. Make changes to a file and save 3. Watch the terminal output of `yarn dev` and note the build is looping
Description of problem:
IHAC with OCP 4.9 who has configured the IngressControllers with a long httpLogFormat, and the routers are printing every time it reloads
I0927 13:29:45.495077 1 router.go:612] template "msg"="router reloaded" "output"="[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'public'.\n[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'fe_sni'.\n[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'fe_no_sni'.\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
This is the Ingress Contoller configuration:
logging: access: destination: syslog: address: 10.X.X.X port: 10514 type: Syslog httpCaptureCookies: - matchType: Exact maxLength: 128 name: ITXSESSIONID httpCaptureHeaders: request: - maxLength: 128 name: Host - maxLength: 128 name: itxrequestid httpLogFormat: actconn="%ac",backend_name="%b",backend_queue="%bq",backend_source_ip="%bi",backend_source_port="%bp",beconn="%bc",bytes_read="%B",bytes_uploaded="%U",captrd_req_cookie="%CC",captrd_req_headers="%hr",captrd_res_cookie="%CS",captrd_res_headers="%hs",client_ip="%ci",client_port="%cp",cluster="ieec1ocp1",datacenter="ieec1",environment="pro",fe_name_transport="%ft",feconn="%fc",frontend_name="%f",hostname="%H",http_version="%HV",log_type="http",method="%HM",query_string="%HQ",req_date="%tr",request="%HP",res_time="%TR",retries="%rc",server_ip="%si",server_name="%s",server_port="%sp",srv_queue="%sq",srv_conn="%sc",srv_queue="%sq",status_code="%ST",Ta="%Ta",Tc="%Tc",tenant="bk",term_state="%tsc",tot_wait_q="%Tw",Tr="%Tr" logEmptyRequests: Ignore
Any way to avoid this truncate warning?
How reproducible:
For every reload of haproxy config
Steps to Reproduce:
You can reproduce easily with the following configuration in the default ingress controller:
logging:
access:
destination:
type: Container
httpCaptureCookies:
2022-10-18T14:13:53.068164+00:00 xxxx xxxxxx haproxy[38]: 10.39.192.203:40698 [18/Oct/2022:14:13:52.488] fe_sni~ be_secure:openshift-console:console/pod:console-5976495467-zxgxr:console:https:10.128.1.116:8443 0/0/0/10/580 200 1130598 _abck=B7EA642C9E828FA8210F329F80B7B2D80YAAQnVozuFVfkOaDAQAADk - --VN 78/37/33/33/0 0/0 "GET /api/kubernetes/openapi/v2 HTTP/1.1"
Description of problem:
Trying to deploy a HostedCluster using an IPv6 network, the control plane fails to start. These are the networking parameters for the HostedCluster: networking: clusterNetwork: - cidr: fd01::/48 networkType: OVNKubernetes serviceNetwork: - cidr: fd02::/112 When the control plane pods are created, the etcd pod will remain in crashloopbackoff. The error in the logs: invalid value "https://fd01:0:0:3::4c:2380" for flag -listen-peer-urls: URL address does not have the form "host:port": https://fd01:0:0:3::4c:2380
Version-Release number of selected component (if applicable):
Any
How reproducible:
Always
Steps to Reproduce:
1. Create a HostedCluster with the networking parameters set to IPv6 networks. 2. The etcd pod will be created and will fail to start.
Actual results:
etcd crashses at start
Expected results:
etcd starts properly and the other control plane pods follow
Additional info:
N/A
Description of problem:
Selecting "Manual" for Update approval does not take effect.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
ose-gcp-pd-csi-driverfails to build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=54433295 Error: /usr/lib/golang/pkg/tool/linux_amd64/link: running gcc failed: exit status 1 gcc: error: static: No such file or directory make: *** [Makefile:40: gce-pd-driver] Error 1
Version-Release number of selected component (if applicable):
4.14 / master
How reproducible:
run osbs build
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel today only has to evaluate `OpenStackNodeCreationFails` but when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks is throttled at one every 10 minutes. This means if there are three new risks it may take up to 30 minutes after the channel has changed for the full set of conditional updates to be computed. This leads to a perception that no update paths are recommended because most will not wait 30 minutes, they expect immediate feedback.
Version-Release number of selected component (if applicable):
4.10.z, 4.11.z, 4.12, 4.13
How reproducible:
100%
Steps to Reproduce:
1. Install 4.10.34 2. Switch from stable-4.10 to stable-4.11 3.
Actual results:
Observe no recommended updates for 10-20 minutes because all available paths to 4.11 have a risk associated with them
Expected results:
Risks are computed in a timely manner for an interactive UX, lets say < 10s
Additional info:
This was intentional in the design, we didn't want risks to continuously re-evaluate or overwhelm the monitoring stack, however we didn't anticipate that we'd have long standing pile of risks and realize how confusing the user experience would be. We intend to work around this in the deployed fleet by converting older risks from `type: promql` to `type: Always` avoiding the evaluation period but preserving the notification. While this may lead customers to believe they're exposed to a risk they may not be, as long as the set of outstanding risks to the latest version is limited to no more than one it's likely no one will notice. All 4.10 and 4.11 clusters currently have a clear path toward relatively recent 4.10.z or 4.11.z with no more than one risk to be evaluated.
Description of problem:
Usually etcd pod is named "etcd-bootstrap" for multinode install. In bootstrap-in-place mode the only master is not started during bootstrap, so its useful to use the expected pod name during bootstrap. This would allow us to re-use the bootstrap-generated certificates on "real" master startup
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Add Audit configuration for hypershift Hosted Cluster not working as expected.
Version-Release number of selected component (if applicable):
# oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2023-05-04-090524 True False 15m Cluster version is 4.13.0-0.nightly-2023-05-04-090524
How reproducible:
Always
Steps to Reproduce:
1. Get hypershift hosted cluster detail from management cluster. # hostedcluster=$( oc get -n clusters hostedclusters -o json | jq -r .items[].metadata.name) 2. Apply audit profile for hypershift hosted cluster. # oc patch HostedCluster $hostedcluster -n clusters -p '{"spec": {"configuration": {"apiServer": {"audit": {"profile": "WriteRequestBodies"}}}}}' --type merge hostedcluster.hypershift.openshift.io/85ea85757a5a14355124 patched # oc get HostedCluster $hostedcluster -n clusters -ojson | jq .spec.configuration.apiServer.audit { "profile": "WriteRequestBodies" } 3. Check Pod or operator restart to apply configuration changes. # oc get pods -l app=kube-apiserver -n clusters-${hostedcluster} NAME READY STATUS RESTARTS AGE kube-apiserver-7c98b66949-9z6rw 5/5 Running 0 36m kube-apiserver-7c98b66949-gp5rx 5/5 Running 0 36m kube-apiserver-7c98b66949-wmk8x 5/5 Running 0 36m # oc get pods -l app=openshift-apiserver -n clusters-${hostedcluster} NAME READY STATUS RESTARTS AGE openshift-apiserver-dc4c84ff4-566z9 3/3 Running 0 29m openshift-apiserver-dc4c84ff4-99zq9 3/3 Running 0 29m openshift-apiserver-dc4c84ff4-9xdrz 3/3 Running 0 30m 4. Check generated audit log. # NOW=$(date -u "+%s"); echo "$NOW"; echo "$NOW" > now 1683711189 # kaspod=$(oc get pods -l app=kube-apiserver -n clusters-${hostedcluster} --no-headers -o=jsonpath={.items[0].metadata.name}) # oc logs $kaspod -c audit-logs -n clusters-${hostedcluster} > kas-audit.log # cat kas-audit.log | grep -iE '"verb":"(get|list|watch)","user":.*(requestObject|responseObject)' | jq -c 'select (.requestReceivedTimestamp | .[0:19] + "Z" | fromdateiso8601 > '"`cat now`)" | wc -l 0 # cat kas-audit.log | grep -iE '"verb":"(create|delete|patch|update)","user":.*(requestObject|responseObject)' | jq -c 'select (.requestReceivedTimestamp | .[0:19] + "Z" | fromdateiso8601 > '"`cat now`)" | wc -l 0 All results should not be zero In backend it should apply the configuration or pod/operator restart after configuration changes.
Actual results:
Config changes not applied in backend.Not operator & pod restart
Expected results:
Configuration should applied and pod & operator should restart after config changes.
Additional info:
In many cases, the /dev/disk/by-path symlink is the only way to stably identify a disk without having prior knowledge of the hardware from some external source (e.g. a spreadsheet of disk serial numbers). It should be possible to specify this path in the root device hints.
Metal³ now allows these paths in the `name` hint (see OCPBUGS-13080), so the IPI installer's implementation using terraform must be changed to match.
Description of problem:
When a MCCPoolAlert is fired and we fix the problem that caused this alert, the alert is not removed.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-06-06-212044 True False 114m Cluster version is 4.14.0-0.nightly-2023-06-06-212044
How reproducible:
Always
Steps to Reproduce:
1. Create a custom MCP apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [master,infra]} nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" 2. Label a master node so that it is included in the new custom MCP $ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra="" 3. Verify that the alert is fired alias thanosalerts='curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" https://$(oc get route -n openshift-monitoring thanos-querier -o jsonpath={.spec.host})/api/v1/alerts | jq ' $ thanosalerts |grep alertname .... "alertname": "MCCPoolAlert", 4. Remove the label from the node to fix the problem $ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra-
Actual results:
The alert is not removed. When we have a look at the mcc_pool_alert metric we find 2 values with 2 different "alert" fields. alias thanosquery='function __lgb() { unset -f __lgb; oc rsh -n openshift-monitoring prometheus-k8s-0 curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" --data-urlencode "query=$1" https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query | jq -c | jq; }; __lgb' $ thanosquery mcc_pool_alert { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "mcc_pool_alert", "alert": "Applying custom label for pool", "container": "oauth-proxy", "endpoint": "metrics", "instance": "10.130.0.86:9001", "job": "machine-config-controller", "namespace": "openshift-machine-config-operator", "node": "ip-10-0-129-20.us-east-2.compute.internal", "pod": "machine-config-controller-76dbddff49-75ggr", "pool": "infra", "prometheus": "openshift-monitoring/k8s", "service": "machine-config-controller" }, "value": [ 1686137977.158, "0" ] }, { "metric": { "__name__": "mcc_pool_alert", "alert": "Given both master and custom pools. Defaulting to master: custom infra", "container": "oauth-proxy", "endpoint": "metrics", "instance": "10.130.0.86:9001", "job": "machine-config-controller", "namespace": "openshift-machine-config-operator", "node": "ip-10-0-129-20.us-east-2.compute.internal", "pod": "machine-config-controller-76dbddff49-75ggr", "pool": "infra", "prometheus": "openshift-monitoring/k8s", "service": "machine-config-controller" }, "value": [ 1686137977.158, "1" ] } ] } }
Expected results:
The alert should be removed.
Additional info:
If we remove the MCO controller pod, a new mcc_pool_alert data is generated with the right value and the other values are removed. If we execute this workaround the alert is removed.
This is a clone of issue OCPBUGS-18754. The following is the description of the original issue:
—
Description of problem:
After control plane release upgrade, in the guest cluster pod 'tuned' uses control plane release image
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. create a cluster in 4.14.0-0.ci-2023-09-06-180503 2. control plane release upgrade to 4.14-2023-09-07-180503 3. in the guest cluster check container image in pod tuned
Actual results:
pod tuned uses control plane release image 4.14-2023-09-07-180503
Expected results:
pod tuned uses release image 4.14.0-0.ci-2023-09-06-180503
Additional info:
After controlplane release upgrade, in control plane namespace, cluster-node-tuning-operator uses control plane release image: jiezhao-mac:hypershift jiezhao$ oc get pods cluster-node-tuning-operator-6dc549ffdf-jhj2k -n clusters-jie-test -ojsonpath='{.spec.containers[].name}{"\n"}' cluster-node-tuning-operator jiezhao-mac:hypershift jiezhao$ oc get pods cluster-node-tuning-operator-6dc549ffdf-jhj2k -n clusters-jie-test -ojsonpath='{.spec.containers[].image}{"\n"}' registry.ci.openshift.org/ocp/4.14-2023-09-07-180503@sha256:60bd6e2e8db761fb4b3b9d68c1da16bf0371343e3df8e72e12a2502640173990
Description of problem:
Stop option for pipelinerun is not working
Version-Release number of selected component (if applicable):
Openshift Pipelines 1.9.x
How reproducible:
Always
Steps to Reproduce:
1. Create a pipeline and start it 2. From Actions dropdown select stop option
Actual results:
Pipelinerun is not getting cancelled
Expected results:
Pipelinerun should get cancelled
Additional info:
Description of problem:
4.13.0-RC.6 Enter to Cluster status: error while trying to install cluster with agent base installer After the read disk stage the cluster status turn to "error"
Version-Release number of selected component (if applicable):
How reproducible:
Create image with the attached install config and agent config file and boot node with this images
Steps to Reproduce:
1. Create image with the attached install config and agent config file and boot node with this images
Actual results:
Cluster status: error
Expected results:
Should continue with cluster status: installing
Additional info:
Description of problem:
In hypershift context: Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/ https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265 These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator. This could be done by looking at the operator deployment itself or at the HCP resource. multus-admission-controller cloud-network-config-controller ovnkube-master
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a hypershift cluster. 2. Check affinity rules and node selector of the operands above. 3.
Actual results:
Operands missing affinity rules and node selecto
Expected results:
Operands have same affinity rules and node selector than the operator
Additional info:
Description of problem:
Pod Status Overlapping in Sidebar Status that is breaking the UI: CreateContainerConfigError
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always when the status is CreateContainerConfigError
Steps to Reproduce:
1. Create a Pod that gives CreateContainerConfigError
Sample YAML:
apiVersion: v1 kind: Pod metadata: name: example labels: app: httpd namespace: avik spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: httpd image: docker.io/httpd:latest ports: - containerPort: 80 securityContext: allowPrivilegeEscalation: true capabilities: drop: - ALL
Actual results:
The Pod Status should not overlapping when the status is long.
Expected results:
The Pod Status should not overlap. Also, this error status should look like the other error statuses.
Additional info:
Description of problem:
Not able to import the repository with .tekton directory and func.yaml file present. As getting this error `Cannot read properties of undefined (reading 'filter')`
Version-Release number of selected component (if applicable):
4.13, Pipeline and Serverless is installed
How reproducible:
Steps to Reproduce:
1. In import from git form enter the git URL: https://github.com/Lucifergene/oc-pipe-func 2. Pipeline is checked and PAC option is selected by default even if user uncheck the Pipeline option user get the same error 3. click Create button
Actual results:
Not able to import and getting this error `Cannot read properties of undefined (reading 'filter')`
Expected results:
should able to import without any error
Additional info:
Description of problem:
An uninstall was started, however it failed due to the hosted-cluster-config-operator being unable to clean up the default ingresscontroller
Version-Release number of selected component (if applicable):
4.12.18
How reproducible:
Unsure - though definitely not 100%
Steps to Reproduce:
1. Uninstall a HyperShift cluster
Actual results:
❯ k logs -n ocm-staging-2439occi66vhbj0pee3s4d5jpi4vpm54-mshen-dr2 hosted-cluster-config-operator-5ccdbfcc4c-9mxfk --tail 10 -f {"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Image registry is removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"} {"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Ensuring ingress controllers are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"} {"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Ensuring load balancers are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"} {"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Load balancers are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"} {"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Ensuring persistent volumes are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"} {"level":"info","ts":"2023-06-06T16:57:21Z","msg":"There are no more persistent volumes. Nothing to cleanup.","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"} {"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Persistent volumes are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"} After manually connecting to the hostedcluster and deleting the ingresscontroller, the uninstall progressed and succeded
Expected results:
The hosted cluster can cleanup the ingresscontrollers successfully and progress the uninstall
Additional info:
HyperShift dump: https://drive.google.com/file/d/1qqjkG4F_mSUCVMz3GbN-lEoqbshPvQcU/view?usp=sharing
Description of problem:
While trying to deploy OCP on GCP the Installer get stuck on the very first step trying to list all the projects the GCP service account used to deploy OCP can list
Version-Release number of selected component (if applicable):
4.13.3 but also happening on 4.12.5 and I presume other releases as well
How reproducible:
Every time
Steps to Reproduce:
1. Use openshift-install to create a cluster in GCP
Actual results:
$ ./openshift-install-4.13.3 create cluster --dir gcp-doha/ --log-level debug DEBUG OpenShift Installer 4.13.3 DEBUG Built from commit 90bb61f38881d07ce94368f0b34089d152ffa4ef DEBUG Fetching Metadata... DEBUG Loading Metadata... DEBUG Loading Cluster ID... DEBUG Loading Install Config... DEBUG Loading SSH Key... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Cluster Name... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Networking... DEBUG Loading Platform... DEBUG Loading Pull Secret... DEBUG Loading Platform... INFO Credentials loaded from environment variable "GOOGLE_CREDENTIALS", file "/home/mak/.gcp/aos-serviceaccount.json" ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: platform.gcp.project: Internal error: context deadline exceeded
Expected results:
The cluster should be deployed with no issues
Additional info:
The GCP user used to deploy OCP has visibility of thousands of projects: > gcloud projects list | wc -l 152793
CNO should respect the `nodeSelector` setting in hostecontrolplane:
Affinity and tolerations support is handled here: https://issues.redhat.com/browse/OCPBUGS-8692
Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-10738.
Description of problem:
Tests Failed.expand_lesslogs in as 'test' user via htpasswd identity provider: Auth test logs in as 'test' user via htpasswd identity provider
CI-search
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
API documentation for HostedCluster states that the webhook kubeconfig field is only supported for IBM Cloud. It should be supported for all platforms.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Review API documentation at https://hypershift-docs.netlify.app/reference/api/
Actual results:
Expected results:
Additional info:
While running the e2e test locally with Hypershift cluster from cluster-bot I noticed that it fails on step waiting for 2 prometheus instances.
“wait for prometheus-k8s: expected 2 Prometheus instances but got: 1: timed out waiting for the condition”
Since Hypershift clusters from cluster-bot are single worker node, it will always fail since we are checking it should be always 2 instances in main_test.go.
Ideally we need to check the infrastructureTopology field and adjust the test if the infrastructure is “SingleReplica”
Description of problem:
ControlPlaneMachineSet Machines are considered Ready once the underlying MAPI machine is Running. This should not be a sufficient condition, as the Node linked to that Machine should also be Ready for the overall CPMS Machine to be considered Ready.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/1914
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Oc debug fails with error "container "container-00" in pod "xiyuan24-f3-h4264-master-0-debug" is waiting to start: ContainerCreating"I see that above error happens when run via automation and running it locally does not have this issue, also when increased time around the command in the automation script it works fine with out any issues.
Version-Release number of selected component (if applicable):
03-24 17:57:54.649 [12:27:48] INFO> Shell Commands: oc version -o yaml --client --kubeconfig=/tmp/kubeconfig20230324-374-gt1vvm 03-24 17:57:54.649 clientVersion: 03-24 17:57:54.649 buildDate: "2023-03-17T23:32:35Z" 03-24 17:57:54.649 compiler: gc 03-24 17:57:54.649 gitCommit: eed143055ede731029931ad204b19cd2f565ef1a 03-24 17:57:54.649 gitTreeState: clean 03-24 17:57:54.649 gitVersion: 4.13.0-202303172327.p0.geed1430.assembly.stream-eed1430 03-24 17:57:54.649 goVersion: go1.19.4 03-24 17:57:54.649 major: "" 03-24 17:57:54.649 minor: "" 03-24 17:57:54.649 platform: linux/amd64 03-24 17:57:54.649 kustomizeVersion: v4.5.7 03-24 17:57:54.649 [12:27:49] INFO> Exit Status: 0
How reproducible:
Always
Steps to Reproduce:
1.Install latest 4.13 cluster 2. Run script https://github.com/openshift/verification-tests/blob/master/features/upgrade/security_compliance/fips.feature#L66
Actual results:
Test fails with error mentioned in the description
Expected results:
Test should not fail
Additional info:
Adding a link to the conversation which i had with maciej about this issue https://redhat-internal.slack.com/archives/GK58XC2G2/p1679655589922729 Run log with --loglevel=9 -> https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Runner/770180/console
Seen in 4.13.0-rc.2, mcc_drain_err is being served for nodes that have been deleted, causing un-actionable MCDDrainError.
At least 4.13.0-rc.2. Further exposure unclear.
At least four nodes on build01. Possibly all nodes that are removed while suffering from drain issues on 4.13.0-rc.2.
Unclear.
The machine-config controller continues to serve mcc_drain_err for the removed nodes.
The machine-config controller never serves{{mcc_drain_err}} for non-existant nodes.
Description of problem:
Bump Kubernetes to 0.27.1 and bump dependencies
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Backport support starting in 4.12.z to a new GCP region europe-west12
Version-Release number of selected component (if applicable):
4.12.z and 4.13.z
How reproducible:
Always
Steps to Reproduce:
1. Use openhift-install to deploy OCP in europe-west12
Actual results:
europe-west12 is not available as a supported region in the user survey
Expected results:
europe-west12 to be available as a supported region in the user survey
Additional info:
Description of problem:
On clusters without the TechPreview feature set enabled, machines are failing to delete due to an attempt to list an IPAM that is not installed.
Version-Release number of selected component (if applicable):
4.14 nightly
How reproducible:
consistently
Steps to Reproduce:
1. Create a platform vSphere cluster 2. Scale down a machine
Actual results:
Machine fails to delete
Expected results:
Machine should delete
Additional info:
Fails with unable to list IPAddressClaims: failed to get API group resources: unable to retrieve the complete list of server APIs: ipam.cluster.x-k8s.io/v1alpha1: the server could not find the requested resource
Description of problem:
After the installation of a cluster, based on the agent installer ISO, is completed, the job assisted-installer-controller remains up
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Generate a valid ISO image using the agent installer. All kind of topologies (compact/ha/sno) and configurations are affect by this problem
Steps to Reproduce:
1. 2. 3.
Actual results:
$ oc get jobs -n assisted-installer NAME COMPLETIONS DURATION AGE assisted-installer-controller 0/1 102m 102m
Expected results:
oc get jobs -n assisted-installer should not return any job
Additional info:
It looks like that the assisted-installer-controller has been designed assuming that Assisted Service (AS) was always available and reachable. This is not necessarily true when using the agent installer, since the AS initially running on the rendezvous node will not be available after the node was rebooted. The assisted-installer-controller performs a number of different tasks internally, and from the logs not all of them complete successfully (a condition to terminate the job). It could be useful to perform a deeper troubleshooting on the ApproveCsrs one, as it one that does not terminate properly
Observing CI Hypershift failures in 4.14.0-0.ci-2023-06-16-074926
Payload includes image-registry/pull/370 which is the current suspected source of the regression
Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/478
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Viewing OperatorHub details page will return error page
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-03-28-180259
How reproducible:
Always on Hypershift Guest cluster
Steps to Reproduce:
1. Visit OperatorHub details page via Administration -> Cluster Settings -> Configuration -> OperatorHub 2. 3.
Actual results:
Cannot read properties of undefined (reading 'sources')
Expected results:
page can be loaded successfully
Additional info:
screenshot one: https://drive.google.com/file/d/12cgpChKYuen2v6DWvmMrir273wONo5oY/view?usp=share_link screenshot two: https://drive.google.com/file/d/1vVsczu7ScIqznoKNsR8V0w4k9bF1xWhB/view?usp=share_link
Description of problem:
When a (recommended/conditional) release image is provided with --to-image='', the specified image name is not preserved in the ClusterVersion object.
Version-Release number of selected component (if applicable):
How reproducible:
100% with oc >4.9
Steps to Reproduce:
$ oc version Client Version: 4.12.2 Kustomize Version: v4.5.7 Server Version: 4.12.2 Kubernetes Version: v1.25.4+a34b9e9 $ oc get clusterversion/version -o jsonpath='{.status.desired}'|jq { "channels": [ "candidate-4.12", "candidate-4.13", "eus-4.12", "fast-4.12", "stable-4.12" ], "image": "quay.io/openshift-release-dev/ocp-release@sha256:31c7741fc7bb73ff752ba43f5acf014b8fadd69196fc522241302de918066cb1", "url": "https://access.redhat.com/errata/RHSA-2023:0569", "version": "4.12.2" } $ oc adm release info 4.12.3 -o jsonpath='{.image}' quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 $ skopeo copy docker://quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 docker://quay.example.com/playground/release-images Getting image source signatures Copying blob 64096b96a7b0 done Copying blob 0e0550faf8e0 done Copying blob 97da74cc6d8f skipped: already exists Copying blob d8190195889e skipped: already exists Copying blob 17997438bedb done Copying blob fdbb043b48dc done Copying config b49bc8b603 done Writing manifest to image destination Storing signatures $ skopeo inspect docker://quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36|jq '.Name,.Digest' "quay.example.com/playground/release-images" "sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36" $ oc adm upgrade --to-image=quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 Requesting update to 4.12.3
Actual results:
$ oc get clusterversion/version -o jsonpath='{.status.desired}'|jq { "channels": [ "candidate-4.12", "candidate-4.13", "eus-4.12", "fast-4.12", "stable-4.12" ], "image": "quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36", <--- not quay.example.com "url": "https://access.redhat.com/errata/RHSA-2023:0728", "version": "4.12.3" } $ oc get clusterversion/version -o jsonpath='{.status.history}'|jq [ { "completionTime": null, "image": "quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36", <--- not quay.example.com "startedTime": "2023-04-28T07:39:11Z", "state": "Partial", "verified": true, "version": "4.12.3" }, { "completionTime": "2023-04-27T14:48:06Z", "image": "quay.io/openshift-release-dev/ocp-release@sha256:31c7741fc7bb73ff752ba43f5acf014b8fadd69196fc522241302de918066cb1", "startedTime": "2023-04-27T14:24:29Z", "state": "Completed", "verified": false, "version": "4.12.2" } ]
Expected results:
$ oc get clusterversion/version -o jsonpath='{.status.desired}'|jq { "channels": [ "candidate-4.12", "candidate-4.13", "eus-4.12", "fast-4.12", "stable-4.12" ], "image": "quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 ", "url": "https://access.redhat.com/errata/RHSA-2023:0728", "version": "4.12.3" }$ oc get clusterversion/version -o jsonpath='{.status.history}'|jq [ { "completionTime": null, "image": "quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 ", "startedTime": "2023-04-28T07:39:11Z", "state": "Partial", "verified": true, "version": "4.12.3" }, { "completionTime": "2023-04-27T14:48:06Z", "image": "quay.io/openshift-release-dev/ocp-release@sha256:31c7741fc7bb73ff752ba43f5acf014b8fadd69196fc522241302de918066cb1", "startedTime": "2023-04-27T14:24:29Z", "state": "Completed", "verified": false, "version": "4.12.2" } ]
Additional info:
While in earlier versions (<4.10) we used to preserve the specified image [1], we now (as of 4.10) store the public image as the desired version [2]. [1] https://github.com/openshift/oc/blob/88cfeb4aa2d74ee5f5598c571661622c0034081b/pkg/cli/admin/upgrade/upgrade.go#L278 [2] https://github.com/openshift/oc/blob/5711859fac135177edf07161615bdabe3527e659/pkg/cli/admin/upgrade/upgrade.go#L278
Description of the problem:
Proliant Gen 11 always reports the serial number "PCA_number.ACC", causing all hosts to register with the same UUID.
How reproducible:
100%
Steps to reproduce:
1. Boot two Proliant Gen 11 hosts
2. See that both hosts are updating a single host entry in the service
Actual results:
All hosts with this hardware are assigned the same UUID
Expected results:
Each host should have a unique UUID
Description of problem:
Hypershift does not utilize existing liveness and readiness probes on openshift-route-controller-manager and openshift-controller-manager.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1.Create OCP cluster using Hypershift 2.Look at openshift-route-controller-manager and openshift-controller-manager yaml manifests
Actual results:
No probes defined for pods of those two deployments
Expected results:
Probes should be defined because the service implement them
Additional info:
This is the result of a security review for 4.12 Hypershift, original investigation can be found https://github.ibm.com/alchemy-containers/armada-update/issues/4117#issuecomment-53149378
Description of problem:
Any FBC enabled OLM Catalog displays the Channels in a random order.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a catalog source for icr.io/cpopen/ibm-operator-catalog:latest 2. Navigate to OperatorHub 3. Click on the `ibm-mq` operator 4. Click on the Install button.
Actual results:
The list of channels is in random order. The order changes with each page refresh.
Expected results:
The list of channels should be in lexicographical ascending order as it was for SQLITE based catalogs.
Additional info:
See related operator-registry upstream issue: https://github.com/operator-framework/operator-registry/issues/1069#top Note: I think both `operator-registry` and the OperatorHub should provide deterministic sorting of these channels.
New regions are added all the time, so it's best to keep it up-to-date.
The goal is to collect metrics about the number of LIST and WATCH requests to the apiserver because it will allow to measure the deployment progress of the API streaming feature. The new feature will replace the use of LIST requests with WATCH.
apiserver_list_watch_request_total:rate:sum
apiserver_list_watch_request_total:rate:sum represents the rate of change for the LIST and WATCH requests over a 5 minute period.
Labels
The cardinality of the metric is at most 2.
Description of problem:
This is a clone for https://issues.redhat.com/browse/CNV-26608
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Update to use Jenkins 4.13 images to address CVEs
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Info message below the "Git access token" Field for creating the Pipelines Repository under the Pipelines section in the Import from Git page is falling back to the default text instead of showing the curated ones of each Git provider. The Info messages are curated for each of the Git Providers when we are creating the Repository from the Pipelines Page.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Go to the Import from Git Page 2. Add a Git URL with PAC ( https://github.com/Lucifergene/oc-pipe ) 3. Check the text under the "Git access token" Field
Actual results:
Use your Git Personal token. Create a token with repo, public_repo & admin:repo_hook scopes and give your token an expiration, i.e 30d.
Expected results:
Use your GitHub Personal token. Use this link to create a token with repo, public_repo & admin:repo_hook scopes and give your token an expiration, i.e 30d.
Additional info:
This issue has been reported multiple times over the years with no resolution
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-vsphere-zones/1655633815252504576
kubeconfig received!
waiting for api to be available
level=error
level=error msg=Error: failed to parse ovf: failed to parse ovf: XML syntax error on line 1: illegal character code U+0000
level=error
level=error msg= with vsphereprivate_import_ova.import[0],
level=error msg= on main.tf line 70, in resource "vsphereprivate_import_ova" "import":
level=error msg= 70: resource "vsphereprivate_import_ova" "import" {
level=error
level=error
level=error msg=Error: failed to parse ovf: failed to parse ovf: XML syntax error on line 1: illegal character code U+0000
https://issues.redhat.com/browse/OCPQE-13219
https://issues.redhat.com/browse/TRT-741
Description of problem:
On OpenShift Container Platform, the etcd Pod is showing messages like the following: 2023-06-19T09:10:30.817918145Z {"level":"warn","ts":"2023-06-19T09:10:30.817Z","caller":"fileutil/purge.go:72","msg":"failed to lock file","path":"/var/lib/etcd/member/wal/000000000000bc4b-00000000183620a4.wal","error":"fileutil: file already locked"} This is described in KCS https://access.redhat.com/solutions/7000327
Version-Release number of selected component (if applicable):
any currently supported version (> 4.10) running with 3.5.x
How reproducible:
always
Steps to Reproduce:
happens after running etcd for a while
This has been discussed in https://github.com/etcd-io/etcd/issues/15360
It's not a harmful error message, it merely indicates that some WALs have not been included in snapshots yet.
This was caused by changing default numbers: https://github.com/etcd-io/etcd/issues/13889
This was fixed in https://github.com/etcd-io/etcd/pull/15408/files but never backported to 3.5.
To mitigate that error and stop confusing people, we should also supply that argument when starting etcd in: https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/pod.yaml#L170-L187
That way we're not surprised by changes of the default values upstream.
Description of problem:
Agent-tui should show before the installation, but it shows again during the installation and when it quit again, the installation fail to go on.
Version-Release number of selected component (if applicable):
4.13.0-0.ci-2023-03-14-045458
How reproducible:
always
Steps to Reproduce:
1. Make sure the primary check pass, and boot the agent.x86_64.iso file, we can see the agent-tui show before the installation 2. Tracking installation by both wait-for output and console output 3. The agent-tui show again during the installation, wait for the agent-tui quit automatically without any user interruption, the installation quit with failure, and we have the following wait-for output: DEBUG asset directory: . DEBUG Loading Agent Config... ... DEBUG Agent Rest API never initialized. Bootstrap Kube API never initialized INFO Waiting for cluster install to initialize. Sleeping for 30 seconds DEBUG Agent Rest API Initialized INFO Cluster is not ready for install. Check validations DEBUG Cluster validation: The pull secret is set. WARNING Cluster validation: The cluster has hosts that are not ready to install. DEBUG Cluster validation: The cluster has the exact amount of dedicated control plane nodes. DEBUG Cluster validation: API virtual IPs are not required: User Managed Networking DEBUG Cluster validation: API virtual IPs are not required: User Managed Networking DEBUG Cluster validation: The Cluster Network CIDR is defined. DEBUG Cluster validation: The base domain is defined. DEBUG Cluster validation: Ingress virtual IPs are not required: User Managed Networking DEBUG Cluster validation: Ingress virtual IPs are not required: User Managed Networking DEBUG Cluster validation: The Machine Network CIDR is defined. DEBUG Cluster validation: The Cluster Machine CIDR is not required: User Managed Networking DEBUG Cluster validation: The Cluster Network prefix is valid. DEBUG Cluster validation: The cluster has a valid network type DEBUG Cluster validation: Same address families for all networks. DEBUG Cluster validation: No CIDRS are overlapping. DEBUG Cluster validation: No ntp problems found DEBUG Cluster validation: The Service Network CIDR is defined. DEBUG Cluster validation: cnv is disabled DEBUG Cluster validation: lso is disabled DEBUG Cluster validation: lvm is disabled DEBUG Cluster validation: odf is disabled DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Valid inventory exists for the host DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient CPU cores DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient minimum RAM DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient disk capacity DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient CPU cores for role master DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient RAM for role master DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Hostname openshift-qe-049.arm.eng.rdu2.redhat.com is unique in cluster DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Hostname openshift-qe-049.arm.eng.rdu2.redhat.com is allowed DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Speed of installation disk has not yet been measured DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host is compatible with cluster platform none DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: VSphere disk.EnableUUID is enabled for this virtual machine DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host agent compatibility checking is disabled DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No request to skip formatting of the installation disk DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: All disks that have skipped formatting are present in the host inventory DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host is connected DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Media device is connected DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No Machine Network CIDR needed: User Managed Networking DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host belongs to all machine network CIDRs DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host has connectivity to the majority of hosts in the cluster DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Platform PowerEdge R740 is allowed WARNING Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host couldn't synchronize with any NTP server DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host clock is synchronized with service DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: All required container images were either pulled successfully or no attempt was made to pull them DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Network latency requirement has been satisfied. DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Packet loss requirement has been satisfied. DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host has been configured with at least one default route. DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the api.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the api-int.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the *.apps.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host subnets are not overlapping DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No IP collisions were detected by host 7a9649d8-4167-a1f9-ad5f-385c052e2744 DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: cnv is disabled DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: lso is disabled DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: lvm is disabled DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: odf is disabled WARNING Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from discovering to insufficient (Host cannot be installed due to following failing validation(s): Host couldn't synchronize with any NTP server) INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host NTP is synced INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from insufficient to known (Host is ready to be installed) INFO Cluster is ready for install INFO Cluster validation: All hosts in the cluster are ready to install. INFO Preparing cluster for installation INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation) INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: New image status registry.ci.openshift.org/ocp/4.13-2023-03-14-045458@sha256:b0d518907841eb35adbc05962d4b2e7d45abc90baebc5a82d0398e1113ec04d0. result: success. time: 1.35 seconds; size: 401.45 Megabytes; download rate: 312.54 MBps INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation) INFO Cluster installation in progress INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from preparing-successful to installing (Installation is in progress) INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Starting installation: bootstrap INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Installing: bootstrap INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Failed: failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon registry.ci.openshift.org/ocp/4.13-2023-03-14-045458@sha256:f85a278868035dc0a40a66ea7eaf0877624ef9fde9fc8df1633dc5d6d1ad4e39 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "... to initialize single run daemon: error initializing rpm-ostree: Error while ensuring access to kublet config.json pull secrets: symlink /var/lib/kubelet/config.json /run/ostree/auth.json: file exists" INFO Cluster has hosts in error INFO cluster has stopped installing... working to recover installation INFO cluster has stopped installing... working to recover installation INFO cluster has stopped installing... working to recover installation INFO cluster has stopped installing... working to recover installation INFO cluster has stopped installing... working to recover installation INFO cluster has stopped installing... working to recover installation INFO cluster has stopped installing... working to recover installation INFO cluster has stopped installing... working to recover installation 4. During the installation, we had NetworkManager-wait-online.service for a while: -- Logs begin at Wed 2023-03-15 03:06:29 UTC, end at Wed 2023-03-15 03:27:30 UTC. -- Mar 15 03:18:52 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: Starting Network Manager Wait Online... Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'. Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: Failed to start Network Manager Wait Online.
Expected results:
The TUI should only show once before the installation.
Description of problem:
The following tests broke the payload for CI and nightly [sig-network][Feature:MultiNetworkPolicy][Serial] should enforce a network policies on secondary network IPv6 [Suite:openshift/conformance/serial] [sig-network][Feature:MultiNetworkPolicy][Serial] should enforce a network policies on secondary network IPv4 [Suite:openshift/conformance/serial]
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Test Panicked: runtime error: invalid memory address or nil pointer dereference
Expected results:
Additional info:
Original PR that broke the payload https://github.com/openshift/origin/pull/27795 Revert to get payloads back to normal https://github.com/openshift/origin/pull/27926 Broken payloads and related jobs and sippy link for additional info https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.ci/release/4.14.0-0.ci-2023-05-17-212447 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1659065324743430144 https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-18-040905 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-serial/1659088328617627648 https://sippy.dptools.openshift.org/sippy-ng/tests/4.14?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522current_runs%2522%252C%2522operatorValue%2522%253A%2522%253E%253D%2522%252C%2522value%2522%253A%25227%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Atrue%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522never-stable%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Atrue%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522aggregated%2522%257D%252C%257B%2522id%2522%253A99%252C%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522%255Bsig-network%255D%255BFeature%253AMultiNetworkPolicy%255D%255BSerial%255D%2520should%2520enforce%2520a%2520network%2520policies%2520on%2520secondary%2520network%2520IPv6%2520%255BSuite%253Aopenshift%252Fconformance%252Fserial%255D%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=current_working_percentage
Description of problem:
2023-02-20T16:27:58.107800612Z + oc observe pods -n openshift-sdn --listen-addr= -l app=sdn -a '{ .status.hostIP }' -- /var/run/add_iptables.sh 2023-02-20T16:27:58.181727766Z Flag --argument has been deprecated, and will be removed in a future release. Use --template instead.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-02-17-090603
How reproducible:
Always
Steps to Reproduce:
1. Deploy Azure OpenShiftSDN cluster 2. Check drop-icmp container logs oc logs -n openshift-sdn -c drop-icmp -l app=sdn --previous 3.
Actual results:
+ true + iptables -F AZURE_ICMP_ACTION + iptables -A AZURE_ICMP_ACTION -j LOG + iptables -A AZURE_ICMP_ACTION -j DROP + oc observe pods -n openshift-sdn --listen-addr= -l app=sdn -a '{ .status.hostIP }' -- /var/run/add_iptables.sh Flag --argument has been deprecated, and will be removed in a future release. Use --template instead. E0220 16:27:07.553592 27842 memcache.go:238] couldn't get current server API group list: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: connection refused E0220 16:27:07.553913 27842 memcache.go:238] couldn't get current server API group list: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: connection refused The connection to the server 172.30.0.1:443 was refused - did you specify the right host or port? Error from server (BadRequest): previous terminated container "drop-icmp" in pod "sdn-v7gqq" not found
Expected results:
No deprecation warning
Additional info:
Description of problem:
In the web console Administrator view, the items under "Observe" in the side navigation menu are duplicated.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is happening because those menu items are now provided by the `monitoring-plugin` dynamic plugin, so we need to remove them from the web console codebase.
Description of problem:
1. CR.status.LastSyncTimestamp should also be updated in the "else" code branch: https://github.com/openshift/cloud-credential-operator/blob/4cb9faca62c31ebea9a11b55f7af764be4ee2cd8/pkg/operator/credentialsrequest/credentialsrequest_controller.go#L1054 2. r.Client.Status().Update is not called on the CR object in memory after this line: https://github.com/openshift/cloud-credential-operator/blob/4cb9faca62c31ebea9a11b55f7af764be4ee2cd8/pkg/operator/credentialsrequest/credentialsrequest_controller.go#L713 So CR.status.conditions are not updated.
Steps to Reproduce:
This results from a static code check.
Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/41
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
After upgrading cluster from 4.10.47 to 4.11.25 issue is observed with Egress router pod, pods are in pending state.
Version-Release number of selected component (if applicable):
4.11.25
How reproducible:
Steps to Reproduce:
1. Upgrade from 4.10.47 to 4.11.25 2. Check if co network is in Managed state 3. Verify that egress pods are not created with errors like : 55s Warning FailedCreatePodSandBox pod/****** (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox *******_d6918859-a4e9-4e5b-ba44-acc70499fa7c_0(9c464935ebaeeeab7be0b056c3f7ed1b7279e21445b9febea29eb280f7ee7429): error adding pod ****** to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [ns/pod/d6918859-a4e9-4e5b-ba44-acc70499fa7c:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'could not open netns "/var/run/netns/503fb77f-3b96-4f23-8356-43e7ae1e1b49": unknown FS magic on "/var/run/netns/503fb77f-3b96-4f23-8356-43e7ae1e1b49": 1021994
Actual results:
Egress router pods in pending state with error message as below: $ omg get events ... 49s Warning FailedCreatePodSandBox pod/xxxx (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_xxxx_379fa7ec-4702-446c-9162-55c2f76989f6_0(86f8c76e9724216143bef024996cb14a7614d3902dcf0d3b7ea858298766630c): error adding pod xxx to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [xxxx/xxxx/379fa7ec-4702-446c-9162-55c2f76989f6:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'could not open netns "/var/run/netns/0d39f378-29fd-4858-a947-51c5c06f1598": unknown FS magic on "/var/run/netns/0d39f378-29fd-4858-a947-51c5c06f1598": 1021994
Expected results:
Egress router pods in running state
Additional info:
Workaround from https://access.redhat.com/solutions/6986283 works : Edit sdn DS in openshift-sdn namespace : - mountPath: /host/var/run/netns <<<<< /var/run/netns mountPropagation: HostToContainer name: host-run-netns readOnly: true
dependencies for the ironic containers are quite old, we need to upgrade them to the latest available to keep up with upstream requirements
Description of problem:
place holder bug to backport common latency failures
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Error message seen during testing: 2023-03-23T22:33:02.507Z ERROR operator.dns_controller dns/controller.go:348 failed to publish DNS record to zone {"record": {"dnsName":"*.example.com","targets":["34.67.189.132"],"recordType":"A","recordTTL":30,"dnsManagementPolicy":"Managed"}, "dnszone": {"id":"ci-ln-95xvtb2-72292-9jj4w-private-zone"}, "error": "googleapi: Error 400: Invalid value for 'entity.change.additions[*.example.com][A].name': '*.example.com', invalid"}
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Steps to Reproduce:
1. Setup 4.13 gcp cluster, install OSSM using http://pastebin.test.redhat.com/1092754 2. Run gateway api e2e against cluster (or create gateway with listener hostname *.example.com) 3. Check ingress operator logs
Actual results:
DNS record not published, and continous error in log
Expected results:
Should publish DNS record to zone without errors
Additional info:
Miciah: The controller should check ManageDNSForDomain when calling EnsureDNSRecord.
Description of the problem:
vSphere vCenter cluster field is missing description
How reproducible:
always
Steps to reproduce:
1. install OCP on vSphere platform
2. Go to Overview -> vSphere, configure
Actual results:
vCenter cluster field is missing description
Expected results:
Description is present
Please review the following PR: https://github.com/operator-framework/operator-marketplace/pull/515
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In case the [appsDomain|https://docs.openshift.com/container-platform/4.13/networking/ingress-operator.html#nw-ingress-configuring-application-domain_configuring-ingress] is specified and a cluster-admin is deleting accidentally all routes on a cluster, the route canary in the namespace openshift-ingress-canary is created with the domain specified in the .spec.appsDomain instead of .spec.domain of the definition in Ingress.config.openshift.io. Additionally the docs are a bit confusing. On one page (https://docs.openshift.com/container-platform/4.13/networking/ingress-operator.html#nw-ingress-configuring-application-domain_configuring-ingress) it's defined as {code:none} As a cluster administrator, you can specify an alternative to the default cluster domain for user-created routes by configuring the appsDomain field. The appsDomain field is an optional domain for OpenShift Container Platform to use instead of the default, which is specified in the domain field. If you specify an alternative domain, it overrides the default cluster domain for the purpose of determining the default host for a new route. For example, you can use the DNS domain for your company as the default domain for routes and ingresses for applications running on your cluster.
In the API spec (https://docs.openshift.com/container-platform/4.11/rest_api/config_apis/ingress-config-openshift-io-v1.html#spec) the correct behaviour is explained
appsDomain is an optional domain to use instead of the one specified in the domain field when a Route is created without specifying an explicit host. If appsDomain is nonempty, this value is used to generate default host values for Route. Unlike domain, appsDomain may be modified after installation. This assumes a new ingresscontroller has been setup with a wildcard certificate.
It would be nice if the wording could be adjusted as `you can specify an alternative to the default cluster domain for user-created routes by configuring` does not fits good as more or less all new created routes (operator created and so on) getting created with the appsDomain.
Version-Release number of selected component (if applicable):{code:none}
OpenShift 4.12.22
How reproducible:
see steps below
Steps to Reproduce:
1. Install OpenShift 2. define .spec.appsDomain in Ingress.config.openshift.io 3. oc delete route canary -n openshift-ingress-canary 4. wait some seconds to get the route recreated and check cluster-operator
Actual results:
Ingress Operator degraded and route recreated with wrong domain (.spec.appsDomain)
Expected results:
Ingress Operator not degraded and route recreated with the correct domain (.spec.domain)
Additional info:
Please see screenshot
Description of problem:
The PowerVS installer will have code which creates a new service instance during installation. Therefore, we need to delete that service instance upon cluster deletion.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Create cluster 2. Delete cluster
Actual results:
No leftover service instance
Expected results:
Additional info:
DoD:
Add more conditions to hypershift_hostedclusters_failure_conditions so metrics provide more info
Description of problem:
This may be something we want to either add a validation for or document. It was initially found at a customer site but I've also confirmed it happens with just a Compact config with no workers. They created an agent-config.yaml with 2 worker nodes but did not set the replicas in install-config.yaml, i.e. they did not set compute: - hyperthreading: Enabled name: worker replicas: {{ num_workers }} This resulted in an install failure as by default 3 worker replicas are created if not defined https://github.com/openshift/installer/blob/master/pkg/types/defaults/machinepools.go#L11 See the attached console screenshot showing that the expected number of hosts doesn't match the actual. I've also duplicated this with a compact config. We can see that the install failed as start-cluster-installation.sh is looking for 6 hosts. [core@master-0 ~]$ sudo systemctl status start-cluster-installation.service ● start-cluster-installation.service - Service that starts cluster installation Loaded: loaded (/etc/systemd/system/start-cluster-installation.service; enabled; vendor preset: enabled) Active: activating (start) since Wed 2023-03-15 14:40:04 UTC; 3min 41s ago Main PID: 3365 (start-cluster-i) Tasks: 5 (limit: 101736) Memory: 1.7M CGroup: /system.slice/start-cluster-installation.service ├─3365 /bin/bash /usr/local/bin/start-cluster-installation.sh ├─5124 /bin/bash /usr/local/bin/start-cluster-installation.sh ├─5132 /bin/bash /usr/local/bin/start-cluster-installation.sh └─5138 diff /tmp/tmp.vIq1jH9Vf2 /etc/issue.d/90_start-install.issueMar 15 14:42:54 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation... Mar 15 14:43:04 master-0 start-cluster-installation.sh[4746]: Hosts known and ready for cluster installation (3/6) Mar 15 14:43:04 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation... Mar 15 14:43:15 master-0 start-cluster-installation.sh[4980]: Hosts known and ready for cluster installation (3/6) Mar 15 14:43:15 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation... Mar 15 14:43:25 master-0 start-cluster-installation.sh[5026]: Hosts known and ready for cluster installation (3/6) Mar 15 14:43:25 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation... Mar 15 14:43:35 master-0 start-cluster-installation.sh[5079]: Hosts known and ready for cluster installation (3/6) Mar 15 14:43:35 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation... Mar 15 14:43:45 master-0 start-cluster-installation.sh[5124]: Hosts known and ready for cluster installation (3/6) Since the compute section in install-config.yaml is optional we can't assume that it will be there https://github.com/openshift/installer/blob/master/pkg/types/installconfig.go#L126
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. Remove the compute section from install-config.yaml 2. Do an install 3. See the failure
Actual results:
Expected results:
Additional info:
After https://issues.redhat.com//browse/HOSTEDCP-1062, the `olm-collect-profiles` CronJob pods did not get NeedManagementKASAccessLabel label and thus fail
# oc logs olm-collect-profiles-28171952-2v8gn
Error: Get "https://172.29.0.1:443/api?timeout=32s": dial tcp 172.29.0.1:443: i/o timeout
Description of the problem:
Staging, BE v2.17.3 - Trying to install OCP 4.13 Nutanix cluster and getting no ingress for host error. Igal saw the error is
Warning FailedScheduling 98m default-scheduler 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling..
Which comes from
removeUninitializedTaint := false if cluster.Platform != nil && *cluster.Platform.Type == models.PlatformTypeVsphere { removeUninitializedTaint = true }
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Description of problem:
When deploying a whereabouts-IPAM-based additional network through the cluster-network-operator, the whereabouts-reconciler daemonset is not deployed on non-amd64 clusters due to an hard-coded nodeSelector introduced by https://github.com/openshift/cluster-network-operator/commit/be095d8c378e177d625a92aeca4e919ed0b5a14f
Version-Release number of selected component (if applicable):
4.13+
How reproducible:
Always. Tested on a connected arm64 AWS cluster using the openshift-sdn network
Steps to Reproduce:
1. oc new-project test1 2. oc patch networks.operator.openshift.io/cluster -p '{"spec":{"additionalNetworks":[{"name":"tertiary-net2","namespace":"test1","rawCNIConfig":"{\n \"cniVersion\": \"0.3.1\",\n \"name\": \"test\",\n \"type\": \"macvlan\",\n \"master\": \"bond0.100\",\n \"ipam\": {\n \"type\": \"whereabouts\",\n \"range\": \"10.10.10.0/24\"\n }\n}","type":"Raw"}],"useMultiNetworkPolicy":true}}' --type=merge 3. oc get daemonsets -n openshift-multus
Actual results:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE whereabouts-reconciler 0 0 0 0 0 kubernetes.io/arch=amd64 7m27s
Expected results:
No kubernetes.io/arch=amd64 set, so that non-amd64 and multi-arch compute clusters can schedule the daemonset on each node, regardless of the architecture.
Additional info:
Same problem on s390x
https://github.com/openshift/hypershift/pull/2437 created a binding between HO and CPO as a CPO that contains this PR crashes when deployed by an HO that does not.
The reason appears to be related to the absence of the OPENSHIFT_IMG_OVERRIDES envvar on the CPO deployment.
{"level":"info","ts":"2023-06-06T16:36:21Z","logger":"setup","msg":"Using CPO image","image":"registry.ci.openshift.org/ocp/4.14-2023-06-06-102645@sha256:2d81c28856f5c0a73e55e7cb6fbc208c738fb3ca7c200cc7eb46efb40c8e10d2"} panic: runtime error: index out of range [1] with length 1 goroutine 1 [running]: github.com/openshift/hypershift/support/util.ConvertImageRegistryOverrideStringToMap({0x0, 0x0}) /hypershift/support/util/util.go:237 +0x454 main.NewStartCommand.func1(0xc000d80000, {0xc000a71180, 0x0, 0x8}) /hypershift/control-plane-operator/main.go:345 +0x2225
containers: - args: - run - --namespace - $(MY_NAMESPACE) - --deployment-name - control-plane-operator - --metrics-addr - 0.0.0.0:8080 - --enable-ci-debug-output=false - --registry-overrides== command: - /usr/bin/control-plane-operator
Description of problem:
sometimes the oc-mirror command will leave big data under /tmp dir and run out of disk space.
Version-Release number of selected component (if applicable):
oc mirror version
4.12/4.13
How reproducible:
Always
Steps to Reproduce:
1. Not sure the detail steps , but see logs when run oc-mirror command :
Actual results:
[root@preserve-fedora36 588]# oc-mirror --config config.yaml docker://yinzhou-133.mirror-registry.qe.gcp.devcluster.openshift.com:5000 --dest-skip-tls Checking push permissions for yinzhou-133.mirror-registry.qe.gcp.devcluster.openshift.com:5000 Creating directory: oc-mirror-workspace/src/publish Creating directory: oc-mirror-workspace/src/v2 Creating directory: oc-mirror-workspace/src/charts Creating directory: oc-mirror-workspace/src/release-signatures No metadata detected, creating new workspace The rendered catalog is invalid. Run "oc-mirror list operators --catalog CATALOG-NAME --package PACKAGE-NAME" for more information. error: error rendering new refs: render reference "registry.redhat.io/redhat/redhat-operator-index:v4.11": write /tmp/render-unpack-2866670795/tmp/cache/cache/red-hat-camel-k_latest_red-hat-camel-k-operator.v1.6.0.json: no space left on device [root@preserve-fedora36 588]# cd /tmp/ [root@preserve-fedora36 tmp]# ls imageset-catalog-registry-333402727 render-unpack-2230547823
Expected results:
Always delete the created datas under /tmp at any stations.
Additional info:
Description of problem:
Tests like lint and vet used to be ran within a container engine by default if an engine was detected, both locally and in CI.Up until now no container engine was detected in CI, so tests would run natively there.Now that the base image we use in CI has now started shipping `podman`, a container engine is detected by default and tests are run within podman by default. But creating nested containers doesn't work in CI at the moment and thus results in a test failure.As such we are switching the default behaviour for tests (both locally and in CI), where now by default no container engine is used to run tests, even if one is detected, but instead tests are run natively unless otherwise specifi
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
We merged a change into origin to modify a test so that `/readyz` would be used as the health check path. It turns out this makes things worse because we want to use kube-proxy's health probe endpoint to monitor the node health, and kube-proxy only exposes `/healthz` which is the default path anyway. We should remove the annotation added to change the path and go back to the defaults.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
In an IPv6 environment using DHCP, it may not be possible to configure a rendezvousIP that matches the actual address. This is because by default NetworkManager uses DUID-UUIDs for Client ID in the IPv6 DHCP Soliciation (see https://datatracker.ietf.org/doc/html/rfc6355) which are machine dependent. As a result, the DHCPv6 server cannot be configured with a pre-determined Client ID/IPv6 Address pair that matches the rendezvousIP and the nodes will be assigned random IPv6 addresses from the pool of DHCP addresses. We can see the flow here (the DUID-UUID has a 00:04 prefix) DHCPSOLICIT(ostestbm) 00:04:56:d2:b1:0b:ba:ef:8c:1a:00:58:3f:ed:e5:d3:5f:85 The DHCP server therefore assigns a new address from the pool, fd2e:6f44:5dd8:c956::32 in this case: DHCPREPLY(ostestbm) fd2e:6f44:5dd8:c956::32 00:04:56:d2:b1:0b:ba:ef:8c:1a:00:58:3f:ed:e5:d3:5f:85 NetworkManager needs to be configured to use a deterministic Client ID so that a reliable Client ID/IPv6 address can be added to a DHCP server. The best way to do this is to configure NM for dhcp-duid=ll so that it uses a DUID-LL which based on the interface mac address. This is the approach taken by Baremetal IPI in https://github.com/openshift/machine-config-operator/pull/1395
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
Every time
Steps to Reproduce:
1. In an IPv6 environment set up agent-config.yaml with an expected IPv6 address and create the ISO 2. It's not possible to configure the DHCP server to assign this address since the Client ID that Node0 will use is unknown 3. Boot the nodes using the created ISO. The nodes will get IPv6 addresses from the DHCP server but its not possible to access the RendezvousIP
Actual results:
Expected results:
Additional info:
It is possible, due to the way that the UI is currently implemented, that a user may be able to submit a manifest with no content.
We need to filter manifests before they are applied to ensure that any manifests that are empty (lack at least one key/value) are not applied.
A good suggested location to look at might be
https://github.com/openshift/assisted-service/blob/master/internal/ignition/ignition.go#L402-L409
Description of problem:
When installing OCP in a disconnected network which doesn’t have access to the public registry, bootkube.service failed
Version-Release number of selected component (if applicable):
from 4.14.0-0.nightly-2023-04-29-153308
How reproducible:
Always
Steps to Reproduce:
1.Prepare a VPC that doesn’t have the access to the Internet, setup a mirror registry inside the VPC and set related ImageContentSource in the install-config 2.Start the installation 3.
Actual results:
Failed when provisioning masters as it couldn’t get master ignition from bootstrap May 04 07:31:56 maxu-az-dis-6d74v-bootstrap bootkube.sh[246724]: error: unable to read image registry.ci.openshift.org/ocp/release@sha256:227a73d8ff198a55ca0d3314d8fa94835d90769981d1c951ac741b82285f99fc: Get "https://registry.ci.openshift.org/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) May 04 07:31:56 maxu-az-dis-6d74v-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILUREMay 04 07:31:56 maxu-az-dis-6d74v-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Expected results:
Installation succeeded.
Additional info:
In disconnected install, we’re using the ICSP to pull image from the mirror registry, but bootkube.service was still trying to access the public registry. Checked the change log of bootkube.sh.template, it seems to be a regression issue of https://github.com/openshift/installer/pull/6990, it’s using “oc adm release info -o 'jsonpath={.metadata.version}' "${RELEASE_IMAGE_DIGEST}"” to get current OCP version in this scenario.
Description of problem:
After custom toleration (tainting the dns pod) on master node the dns pod stuck in pending state
Version-Release number of selected component (if applicable):
How reproducible:
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-41050
Steps to Reproduce:
1.melvinjoseph@mjoseph-mac Downloads % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-05-03-163151 True False 4h5m Cluster version is 4.14.0-0.nightly-2023-05-03-163151 2.check default dns pods placement melvinjoseph@mjoseph-mac Downloads % ouf5M-5AVBm-Taoxt-aIgPmoc -n openshift-dns get pod -owide melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dns-default-6cv9k 2/2 Running 0 4h12m 10.131.0.8 shudi-gcp4h-whdkl-worker-a-qnvjw.c.openshift-qe.internal <none> <none> dns-default-8g2w8 2/2 Running 0 4h12m 10.129.2.5 shudi-gcp4h-whdkl-worker-c-b8qrq.c.openshift-qe.internal <none> <none> dns-default-df7zj 2/2 Running 0 4h18m 10.128.0.40 shudi-gcp4h-whdkl-master-1.c.openshift-qe.internal <none> <none> dns-default-kmv4c 2/2 Running 0 4h18m 10.130.0.9 shudi-gcp4h-whdkl-master-2.c.openshift-qe.internal <none> <none> dns-default-lxxkt 2/2 Running 0 4h18m 10.129.0.11 shudi-gcp4h-whdkl-master-0.c.openshift-qe.internal <none> <none> dns-default-mjrnx 2/2 Running 0 4h11m 10.128.2.4 shudi-gcp4h-whdkl-worker-b-scqdh.c.openshift-qe.internal <none> <none> node-resolver-5bnjv 1/1 Running 0 4h12m 10.0.128.3 shudi-gcp4h-whdkl-worker-a-qnvjw.c.openshift-qe.internal <none> <none> node-resolver-7ns8b 1/1 Running 0 4h18m 10.0.0.4 shudi-gcp4h-whdkl-master-1.c.openshift-qe.internal <none> <none> node-resolver-bz7k5 1/1 Running 0 4h12m 10.0.128.2 shudi-gcp4h-whdkl-worker-c-b8qrq.c.openshift-qe.internal <none> <none> node-resolver-c67mw 1/1 Running 0 4h18m 10.0.0.3 shudi-gcp4h-whdkl-master-2.c.openshift-qe.internal <none> <none> node-resolver-d8h65 1/1 Running 0 4h12m 10.0.128.4 shudi-gcp4h-whdkl-worker-b-scqdh.c.openshift-qe.internal <none> <none> node-resolver-rgb92 1/1 Running 0 4h18m 10.0.0.5 shudi-gcp4h-whdkl-master-0.c.openshift-qe.internal <none> <none> 3.oc -n openshift-dns get ds/dns-default -oyaml tolerations: - key: node-role.kubernetes.io/master operator: Exists melvinjoseph@mjoseph-mac Downloads % oc get dns.operator default -oyaml apiVersion: operator.openshift.io/v1 kind: DNS metadata: creationTimestamp: "2023-05-08T00:39:00Z" finalizers: - dns.operator.openshift.io/dns-controller generation: 1 name: default resourceVersion: "22893" uid: ae53e756-42a3-4c9d-8284-524df006382d spec: cache: negativeTTL: 0s positiveTTL: 0s logLevel: Normal nodePlacement: {} operatorLogLevel: Normal upstreamResolvers: policy: Sequential transportConfig: {} upstreams: - port: 53 type: SystemResolvConf status: clusterDomain: cluster.local clusterIP: 172.30.0.10 conditions: - lastTransitionTime: "2023-05-08T00:46:20Z" message: Enough DNS pods are available, and the DNS service has a cluster IP address. reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2023-05-08T00:46:20Z" message: All DNS and node-resolver pods are available, and the DNS service has a cluster IP address. reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2023-05-08T00:39:25Z" message: The DNS daemonset has available pods, and the DNS service has a cluster IP address. reason: AsExpected status: "True" type: Available - lastTransitionTime: "2023-05-08T00:39:01Z" message: DNS Operator can be upgraded reason: AsExpected status: "True" type: Upgradeable 4. config custom tolerations of dns pod (to not tolerate master node taints) $ oc edit dns.operator default spec: nodePlacement: tolerations: - effect: NoExecute key: my-dns-test operators: Equal value: abc tolerationSeconds: 3600 melvinjoseph@mjoseph-mac Downloads % oc edit dns.operator default Warning: unknown field "spec.nodePlacement.tolerations[0].operators" dns.operator.openshift.io/default edited melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dns-default-6cv9k 2/2 Running 0 5h16m 10.131.0.8 shudi-gcp4h-whdkl-worker-a-qnvjw.c.openshift-qe.internal <none> <none> dns-default-8g2w8 2/2 Running 0 5h16m 10.129.2.5 shudi-gcp4h-whdkl-worker-c-b8qrq.c.openshift-qe.internal <none> <none> dns-default-df7zj 2/2 Running 0 5h22m 10.128.0.40 shudi-gcp4h-whdkl-master-1.c.openshift-qe.internal <none> <none> dns-default-kmv4c 2/2 Running 0 5h22m 10.130.0.9 shudi-gcp4h-whdkl-master-2.c.openshift-qe.internal <none> <none> dns-default-lxxkt 2/2 Running 0 5h22m 10.129.0.11 shudi-gcp4h-whdkl-master-0.c.openshift-qe.internal <none> <none> dns-default-mjrnx 2/2 Running 0 5h16m 10.128.2.4 shudi-gcp4h-whdkl-worker-b-scqdh.c.openshift-qe.internal <none> <none> dns-default-xqxr9 0/2 Pending 0 7s <none> <none> <none> <none> node-resolver-5bnjv 1/1 Running 0 5h17m 10.0.128.3 shudi-gcp4h-whdkl-worker-a-qnvjw.c.openshift-qe.internal <none> <none> node-resolver-7ns8b 1/1 Running 0 5h22m 10.0.0.4 shudi-gcp4h-whdkl-master-1.c.openshift-qe.internal <none> <none> node-resolver-bz7k5 1/1 Running 0 5h16m 10.0.128.2 shudi-gcp4h-whdkl-worker-c-b8qrq.c.openshift-qe.internal <none> <none> node-resolver-c67mw 1/1 Running 0 5h22m 10.0.0.3 shudi-gcp4h-whdkl-master-2.c.openshift-qe.internal <none> <none> node-resolver-d8h65 1/1 Running 0 5h16m 10.0.128.4 shudi-gcp4h-whdkl-worker-b-scqdh.c.openshift-qe.internal <none> <none> node-resolver-rgb92 1/1 Running 0 5h22m 10.0.0.5 shudi-gcp4h-whdkl-master-0.c.openshift-qe.internal <none> <none> The dns pod stuck in pending state melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get ds/dns-default -oyaml <-----snip---> tolerations: - effect: NoExecute key: my-dns-test tolerationSeconds: 3600 value: abc volumes: - configMap: defaultMode: 420 items: - key: Corefile path: Corefile name: dns-default name: config-volume - name: metrics-tls secret: defaultMode: 420 secretName: dns-default-metrics-tls updateStrategy: rollingUpdate: maxSurge: 10% maxUnavailable: 0 type: RollingUpdate status: currentNumberScheduled: 3 desiredNumberScheduled: 3 numberAvailable: 3 numberMisscheduled: 3 numberReady: 3 observedGeneration: 2 melvinjoseph@mjoseph-mac Downloads % oc get dns.operator default -oyaml apiVersion: operator.openshift.io/v1 kind: DNS metadata: creationTimestamp: "2023-05-08T00:39:00Z" finalizers: - dns.operator.openshift.io/dns-controller generation: 2 name: default resourceVersion: "125435" uid: ae53e756-42a3-4c9d-8284-524df006382d spec: cache: negativeTTL: 0s positiveTTL: 0s logLevel: Normal nodePlacement: tolerations: - effect: NoExecute key: my-dns-test tolerationSeconds: 3600 value: abc operatorLogLevel: Normal upstreamResolvers: policy: Sequential transportConfig: {} upstreams: - port: 53 type: SystemResolvConf status: clusterDomain: cluster.local clusterIP: 172.30.0.10 conditions: - lastTransitionTime: "2023-05-08T00:46:20Z" message: Enough DNS pods are available, and the DNS service has a cluster IP address. reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2023-05-08T06:01:52Z" message: Have 0 up-to-date DNS pods, want 3. reason: Reconciling status: "True" type: Progressing - lastTransitionTime: "2023-05-08T00:39:25Z" message: The DNS daemonset has available pods, and the DNS service has a cluster IP address. reason: AsExpected status: "True" type: Available - lastTransitionTime: "2023-05-08T00:39:01Z" message: DNS Operator can be upgraded reason: AsExpected status: "True" type: Upgradeable melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get pod NAME READY STATUS RESTARTS AGE dns-default-6cv9k 2/2 Running 0 5h18m dns-default-8g2w8 2/2 Running 0 5h18m dns-default-df7zj 2/2 Running 0 5h25m dns-default-kmv4c 2/2 Running 0 5h25m dns-default-lxxkt 2/2 Running 0 5h25m dns-default-mjrnx 2/2 Running 0 5h18m dns-default-xqxr9 0/2 Pending 0 2m12s node-resolver-5bnjv 1/1 Running 0 5h19m node-resolver-7ns8b 1/1 Running 0 5h25m node-resolver-bz7k5 1/1 Running 0 5h19m node-resolver-c67mw 1/1 Running 0 5h25m node-resolver-d8h65 1/1 Running 0 5h19m node-resolver-rgb92 1/1 Running 0 5h25m
Actual results:
The dns pod dns-default-xqxr9 stuck in pending state
Expected results:
There will be reloaded DNS pods
Additional info:
melvinjoseph@mjoseph-mac Downloads % oc describe po/dns-default-xqxr9 -n openshift-dns Name: dns-default-xqxr9 Namespace: openshift-dns Priority: 2000001000 <----snip---> Node-Selectors: kubernetes.io/os=linux Tolerations: my-dns-test=abc:NoExecute for 3600s node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 3m45s default-scheduler 0/6 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 1 Preemption is not helpful for scheduling, 2 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 3 node(s) didn't match Pod's node affinity/selector..
This bug is created to get CNV bugzilla bug https://bugzilla.redhat.com/show_bug.cgi?id=2164836 fix into MCO repo.
Description of problem:
The upgrade Helm Release tab in OpenShift GUI Developer console is not refreshing with updated values.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. Add below Helm chart repository from CLI ~~~ apiVersion: helm.openshift.io/v1beta1 kind: HelmChartRepository metadata: name: prometheus-community spec: connectionConfig: url: 'https://prometheus-community.github.io/helm-charts' name: prometheus-community ~~~ 2. Goto GUI and select Developer console --> +Add --> Developer Catalog --> Helm Chart --> Select Prometheus Helm chart --> Install Helm chart --> From dropdown of chart version select 22.3.0 --> Install 3. You will see the image tag as v0.63.0 ~~~ image: digest: '' pullPolicy: IfNotPresent repository: quay.io/prometheus-operator/prometheus-config-reloader tag: v0.63.0 ~~~ 4. Once that is installed Goto Helm --> Helm Releases --> Prometheus --> Upgrade --> From dropdown of chart version select 22.4.0 --> the page does not refresh with new value of the tag. ~~~ image: digest: '' pullPolicy: IfNotPresent repository: quay.io/prometheus-operator/prometheus-config-reloader tag: v0.63.0 ~~~ NOTE: The same steps before installing the helm chart, when we select different versions the value is being updated. Goto GUI and select Developer console --> +Add --> Developer Catalog --> Helm Chart --> Select Prometheus Helm chart --> Install Helm chart --> From dropdown of chart version select 22.3.0 --> Now select different chart version like 22.7.0 or 22.4.0
Actual results:
The The yaml view of Upgrade Helm Release tab shows the values of older chart version.
Expected results:
The yaml view of Upgrade Helm Release tab should contain latest values as per selected chart version.
Additional info:
Description of problem:
Customer upgraded AWS cluster from 4.8 to 4.9. All are update well but When checking the co/storage.status.versions, the AWSEBSCSIDriverOperator version is list but with previous version: $ oc get co storage -o json | jq .status.versions [ { "name": "operator", "version": "4.9.50" }, { "name": "AWSEBSCSIDriverOperator", "version": "4.8.48" } ] From 4.9, seems CSO doesn't report the CSIDriverOperator version, so the previous CSIDriverOperator version which is not correct should be cleaned up in such case.
Version-Release number of selected component (if applicable):
upgrade from 4.8.48 to 4.9.50
How reproducible:
Always
Steps to Reproduce:
1. Install AWS cluster with 4.8 2. Upgrade cluster to 4.9 3. Check co/storage.status.versions
Actual results:
[ { "name": "operator", "version": "4.9.50" }, { "name": "AWSEBSCSIDriverOperator", "version": "4.8.48" } ]
Expected results:
From 4.9. seems CSO doesn't report the CSIDriverOperator version, so the previous CSIDriverOperator version which is not correct should be cleaned up.
Additional info:
Description of problem:
Bump Kubernetes to 0.27.1 and bump dependencies
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
https://search.ci.openshift.org/?search=error%3A+tag+latest+failed%3A+Internal+error+occurred%3A+registry.centos.org&maxAge=48h&context=1&type=build-log&name=okd&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
all currently tested versions
How reproducible:
~ 9% of jobs fail on this test
! error: Import failed (InternalError): Internal error occurred: registry.centos.org/dotnet/dotnet-31-runtime-centos7:latest: Get "https://registry.centos.org/v2/": dial tcp: lookup registry.centos.org on 172.30.0.10:53: no such host 782 31 minutes ago
Description of problem:
Customer used Agent-based installer to install 4.13.8 on they CID env, but during install process, the bootstrap machine had oom issue, check sosreport find the init container had oom issue
NOTE: Issue is not see when testing with 4.13.6, per the customer
initContainers:
we found the sosreport dmesg and crio logs had oom kill machine-config-controller container issue, the issue was cause by cgroup kill, so looks like the limit 50M is too small
The customer used a physical machine that had 100GB of memory
the customer had some network config in asstant install yaml file, maybe the issue is them had some nic config?
log files:
1. sosreport
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/b5501734-60be-4de4-adcf-da57e22cbb8e?usePresignedUrl=true
2. asstent installer yaml file
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/a32635cf-112d-49ed-828c-4501e95a0e7a?usePresignedUrl=true
3. bootstrap machine oom screenshot
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/eefe2e57-cd23-4abd-9e0b-dd45f20a34d2?usePresignedUrl=true
Description of problem:
Machine should create failed when availabilityZone and subnet id is mismatch, currently the machine create successfully when availabilityZone and subnet id is mismatch, and the cpms cannot be recreated after deleting. Another, for the subnet is filter, if availabilityZone and filter is mismatch, the machine will create failed.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-01-31-072358
How reproducible:
always
Steps to Reproduce:
1.Create a machineset whose availabilityZone and subnet id is mismatch, for example, availabilityZone is us-east-2a, but the subnet id is for us-east-2b placement: availabilityZone: us-east-2a region: us-east-2 securityGroups: - filters: - name: tag:Name values: - huliu-aws1w-nk5xd-worker-sg subnet: id: subnet-0107b4d7cfa35eb9b 2.Machine created successfully in us-east-2b zone liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws1w-nk5xd-master-0 Running m6i.xlarge us-east-2 us-east-2a 62m huliu-aws1w-nk5xd-master-1 Running m6i.xlarge us-east-2 us-east-2b 62m huliu-aws1w-nk5xd-master-2 Running m6i.xlarge us-east-2 us-east-2a 62m huliu-aws1w-nk5xd-windows-worker-us-east-2a-689vq Running m5a.large us-east-2 us-east-2b 37m huliu-aws1w-nk5xd-windows-worker-us-east-2a-nf9dl Running m5a.large us-east-2 us-east-2b 37m huliu-aws1w-nk5xd-worker-us-east-2a-8kpht Running m6i.xlarge us-east-2 us-east-2a 59m huliu-aws1w-nk5xd-worker-us-east-2a-dmtlc Running m6i.xlarge us-east-2 us-east-2a 59m huliu-aws1w-nk5xd-worker-us-east-2b-kdn75 Running m6i.xlarge us-east-2 us-east-2b 59m liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml |grep "id: subnet" id: subnet-0fef0e9e255742f3a id: subnet-0107b4d7cfa35eb9b id: subnet-0fef0e9e255742f3a id: subnet-0107b4d7cfa35eb9b id: subnet-0107b4d7cfa35eb9b id: subnet-0fef0e9e255742f3a id: subnet-0fef0e9e255742f3a id: subnet-0107b4d7cfa35eb9b
Actual results:
Machine created successfully in the zone which the subnet id stands for, for the case it created in us-east-2b huliu-aws1w-nk5xd-windows-worker-us-east-2a-689vq Running m5a.large us-east-2 us-east-2b 37m huliu-aws1w-nk5xd-windows-worker-us-east-2a-nf9dl Running m5a.large us-east-2 us-east-2b 37m
Expected results:
Machine should create failed as availabilityZone and subnet id is mismatch
Additional info:
1. For the subnet is filter, if availabilityZone and filter is mismatch, the machine will create failed. huliu-aws1w2-x2tnx-worker-2-m4r8m Failed 4s liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-aws1w2-x2tnx-worker-2-m4r8m -o yaml … placement: availabilityZone: us-east-2a region: us-east-2 securityGroups: - filters: - name: tag:Name values: - huliu-aws1w2-x2tnx-worker-sg spotMarketOptions: {} subnet: filters: - name: tag:Name values: - huliu-aws1w2-x2tnx-private-us-east-2c tags: - name: kubernetes.io/cluster/huliu-aws1w2-x2tnx value: owned userDataSecret: name: worker-user-data status: conditions: - lastTransitionTime: "2023-02-01T02:45:52Z" status: "True" type: Drainable - lastTransitionTime: "2023-02-01T02:45:52Z" message: Instance has not been created reason: InstanceNotCreated severity: Warning status: "False" type: InstanceExists - lastTransitionTime: "2023-02-01T02:45:52Z" status: "True" type: Terminable errorMessage: 'error getting subnet IDs: no subnet IDs were found' errorReason: InvalidConfiguration lastUpdated: "2023-02-01T02:45:53Z" phase: Failed providerStatus: conditions: - lastTransitionTime: "2023-02-01T02:45:53Z" message: 'error getting subnet IDs: no subnet IDs were found' reason: MachineCreationFailed status: "False" type: MachineCreation 2.For this case, machine create successfully when availabilityZone and subnet id is mismatch, the cpms cannot be recreated after deleting. liuhuali@Lius-MacBook-Pro huali-test % oc delete controlplanemachineset cluster controlplanemachineset.machine.openshift.io "cluster" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset No resources found in openshift-machine-api namespace. I0201 02:11:07.850022 1 http.go:143] controller-runtime/webhook/webhooks "msg"="wrote response" "UID"="12f118c4-fafe-45f9-bd24-876abdb8ba83" "allowed"=false "code"=403 "reason"="spec.template.machines_v1beta1_machine_openshift_io.failureDomains: Forbidden: no control plane machine is using specified failure domain(s) [AWSFailureDomain{AvailabilityZone:us-east-2a, Subnet:{Type:ID, Value:subnet-0107b4d7cfa35eb9b}}], failure domain(s) [AWSFailureDomain{AvailabilityZone:us-east-2a, Subnet:{Type:ID, Value:subnet-0fef0e9e255742f3a}}] are duplicated within the control plane machines, please correct failure domains to match control plane machines" "webhook"="/validate-machine-openshift-io-v1-controlplanemachineset" I0201 02:11:07.850787 1 controller.go:144] "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="767c4631-ed83-47da-b316-29a21cdba245" E0201 02:11:07.850828 1 controller.go:326] "msg"="Reconciler error" "error"="error reconciling control plane machine set: unable to create control plane machine set: unable to create control plane machine set: admission webhook \"controlplanemachineset.machine.openshift.io\" denied the request: spec.template.machines_v1beta1_machine_openshift_io.failureDomains: Forbidden: no control plane machine is using specified failure domain(s) [AWSFailureDomain{AvailabilityZone:us-east-2a, Subnet:{Type:ID, Value:subnet-0107b4d7cfa35eb9b}}], failure domain(s) [AWSFailureDomain{AvailabilityZone:us-east-2a, Subnet:{Type:ID, Value:subnet-0fef0e9e255742f3a}}] are duplicated within the control plane machines, please correct failure domains to match control plane machines" "controller"="controlplanemachinesetgenerator" "reconcileID"="767c4631-ed83-47da-b316-29a21cdba245"
Description of problem:
With the recent update in the logic for considering a CPMS replica Ready only when both the backing Machine is running and the backing Node is Ready: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/171, we now need to watch nodes at all times to detect nodes transitioning in readiness.
The majority of occurrences of this issue have been fixed with: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/177 (https://issues.redhat.com//browse/OCPBUGS-10032) but we also need to watch the control plane nodes at steady state (when they are already Ready), to notice if they go UnReady at any point, as relying on control plane machine events is not enough (they might be Running, while the Node has transitioned to NotReady).
Version-Release number of selected component (if applicable):
4.13, 4.14
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Topology UI doesn't recognize Serverless Rust function for proper UI icon
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Always
Steps to Reproduce:
1. Deploy 3 KNative/Serverless functions: Quarkus, Spring Boot, Rust 2. Observe in Topology UI that only for Quarku and Spring Boot specific icons are used, while for Rust case - regular icon for OpenShift 3. Check each of presented UI snippets/rectangles and find such related labels: For Quarkus: app.openshift.io/runtime=quarkus function.knative.dev/runtime=rust For Spring Boot: app.openshift.io/runtime=spring-boot function.knative.dev/runtime=springboot For Rust: function.knative.dev/runtime=rust (no presence of app.openshift.io/runtime=rust for it)
Actual results:
No specific UI icon for Rust function
Expected results:
Specific UI icon for Rust function
Additional info:
Description of problem:
Currently: Hypershift is squashing any user configured proxy configuration based on this line: https://github.com/openshift/hypershift/blob/main/support/globalconfig/proxy.go#L21-L28, https://github.com/openshift/hypershift/blob/release-4.11/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L487-L493. Because of this any user changes to the cluster-wide proxy configuration documented here: https://docs.openshift.com/container-platform/4.12/networking/enable-cluster-wide-proxy.html are squashed and not valid for more than a few seconds. That blocks some functionality in the openshift cluster from working including application builds from the openshift samples provided in the cluster.
Version-Release number of selected component (if applicable):
4.13 4.12 4.11
How reproducible:
100%
Steps to Reproduce:
1. Make a change to the Proxy object in the cluster with kubectl edit proxy cluster 2. Save the change 3. Wait a few seconds
Actual results:
HostedClusterConfig operator will go in and squash the value
Expected results:
The value the user provides remains in the configuration and is not squashed to an empty value
Additional info:
Description of problem:
In awsendpointservice CR AWSEndpointAvailable is still true when endpoint is deleted on AWS console, and AWSEndpointServiceAvailable is still true when endpoint service is deleted on AWS console.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create a PublicAndPrivate or Private cluster, wait for cluster to come up 2. Check conditions in awsendpointservice cr, status of AWSEndpointAvailable and AWSEndpointServiceAvailable should be True 3. On AWS console delete endpoint 4. In awsendpointservice cr, check if condition AWSEndpointAvailable is changed to false 5. On AWS console delete endpoint service 6. In awsendpointservice cr, check if condition AWSEndpointServiceAvailable is changed to false
Actual results:
status of AWSEndpointAvailable and AWSEndpointServiceAvailable is True
Expected results:
status of AWSEndpointAvailable and AWSEndpointServiceAvailable should be False
Additional info:
Since resource type option has been moved to an advanced option in both the Deploy Image and Import from Git flows, there is confusion for some existing customers who are using the feature.
The UI no longer provides transparency of the type of resource which is being created.
1.
2.
3.
Remove Resource type from Adv Options, and place it back where it was previously. Resource type selection is now a dropdown so that we will put it in its previous spot, but it will use a different component from 4.11.
Description of problem:
clusteroperator/network is degraded after running FEATURES_ENVIRONMENT="ci" make feature-deploy-on-ci from openshift-kni/cnf-features-deploy against IPI clusters with OCP 4.13 and 4.14 in CI jobs from Telco 5G DevOps/CI. Details for a 4.13 job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/42141/rehearse-42141-periodic-ci-openshift-release-master-nightly-4.13-e2e-telco5g/1689935408508440576 Details for a 4.14 job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/42141/rehearse-42141-periodic-ci-openshift-release-master-nightly-4.14-e2e-telco5g/1689935408541995008 For example, got to artifacts/e2e-telco5g/telco5g-gather-pao/build-log.txt and it will report: Error from server (BadRequest): container "container-00" in pod "cnfdu5-worker-0-debug" is waiting to start: ContainerCreating Running gather-pao for T5CI_VERSION=4.13 Running for CNF_BRANCH=master Running PAO must-gather with tag pao_mg_tag=4.12 [must-gather ] OUT Using must-gather plug-in image: quay.io/openshift-kni/performance-addon-operator-must-gather:4.12-snapshot When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: ClusterID: 60503edf-ecc6-48f7-b6a6-f4dc34842803 ClusterVersion: Stable at "4.13.0-0.nightly-2023-08-10-021434" ClusterOperators: clusteroperator/network is degraded because DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-7lmlq is in CrashLoopBackOff State DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-95tzb is in CrashLoopBackOff State DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-hfxkd is in CrashLoopBackOff State DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-mhwtp is in CrashLoopBackOff State DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-q7gfb is in CrashLoopBackOff State DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - last change 2023-08-11T10:54:10Z
Version-Release number of selected component (if applicable):
branch release-4.13 from https://github.com/openshift-kni/cnf-features-deploy.git for OCP 4.13 branch master from https://github.com/openshift-kni/cnf-features-deploy.git for OCP 4.14
How reproducible:
Always.
Steps to Reproduce:
1. Install OCP 4.13 or OCP 4.14 with IPI on 3x masters, 2x workers. 2. Clone https://github.com/openshift-kni/cnf-features-deploy.git 3. FEATURES_ENVIRONMENT="ci" make feature-deploy-on-ci 4. oc wait nodes --all --for=condition=Ready=true --timeout=10m 5. oc wait clusteroperators --all --for=condition=Progressing=false --timeout=10m
Actual results:
See above.
Expected results:
All clusteroperators have finished progressing.
Additional info:
Without 'FEATURES_ENVIRONMENT="ci" make feature-deploy-on-ci' the steps to reproduce above work as expected.
This is a clone of issue OCPBUGS-18517. The following is the description of the original issue:
—
Description of problem:
Installation with Kuryr is failing because multiple components are attempting to connect to the API and fail with the following error: failed checking apiserver connectivity: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-service-ca/leases/service-ca-controller-lock": tls: failed to verify certificate: x509: cannot validate certificate for 172.30.0.1 because it doesn't contain any IP SANs $ oc get po -A -o wide |grep -v Running |grep -v Pending |grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-apiserver-operator openshift-apiserver-operator-559d855c56-c2rdr 0/1 CrashLoopBackOff 42 (2m28s ago) 3h44m 10.128.16.86 kuryr-5sxhw-master-2 <none> <none> openshift-apiserver apiserver-6b9f5d48c4-bj6s6 0/2 CrashLoopBackOff 92 (4m25s ago) 3h36m 10.128.70.10 kuryr-5sxhw-master-2 <none> <none> openshift-cluster-csi-drivers manila-csi-driver-operator-75b64d8797-fckf5 0/1 CrashLoopBackOff 42 (119s ago) 3h41m 10.128.56.21 kuryr-5sxhw-master-0 <none> <none> openshift-cluster-csi-drivers openstack-cinder-csi-driver-operator-84dfd8d89f-kgtr8 0/1 CrashLoopBackOff 42 (82s ago) 3h41m 10.128.56.9 kuryr-5sxhw-master-0 <none> <none> openshift-cluster-node-tuning-operator cluster-node-tuning-operator-7fbb66545c-kh6th 0/1 CrashLoopBackOff 46 (3m5s ago) 3h44m 10.128.6.40 kuryr-5sxhw-master-2 <none> <none> openshift-cluster-storage-operator cluster-storage-operator-5545dfcf6d-n497j 0/1 CrashLoopBackOff 42 (2m23s ago) 3h44m 10.128.21.175 kuryr-5sxhw-master-2 <none> <none> openshift-cluster-storage-operator csi-snapshot-controller-ddb9469f9-bc4bb 0/1 CrashLoopBackOff 45 (2m17s ago) 3h41m 10.128.20.106 kuryr-5sxhw-master-1 <none> <none> openshift-cluster-storage-operator csi-snapshot-controller-operator-6d7b66dbdd-xdwcs 0/1 CrashLoopBackOff 42 (92s ago) 3h44m 10.128.21.220 kuryr-5sxhw-master-2 <none> <none> openshift-config-operator openshift-config-operator-c5d5d964-2w2bv 0/1 CrashLoopBackOff 80 (3m39s ago) 3h44m 10.128.43.39 kuryr-5sxhw-master-2 <none> <none> openshift-controller-manager-operator openshift-controller-manager-operator-754d748cf7-rzq6f 0/1 CrashLoopBackOff 42 (3m6s ago) 3h44m 10.128.25.166 kuryr-5sxhw-master-2 <none> <none> openshift-etcd-operator etcd-operator-76ddc94887-zqkn7 0/1 CrashLoopBackOff 49 (30s ago) 3h44m 10.128.32.146 kuryr-5sxhw-master-2 <none> <none> openshift-ingress-operator ingress-operator-9f76cf75b-cjx9t 1/2 CrashLoopBackOff 39 (3m24s ago) 3h44m 10.128.9.108 kuryr-5sxhw-master-2 <none> <none> openshift-insights insights-operator-776cd7cfb4-8gzz7 0/1 CrashLoopBackOff 46 (4m21s ago) 3h44m 10.128.15.102 kuryr-5sxhw-master-2 <none> <none> openshift-kube-apiserver-operator kube-apiserver-operator-64f4db777f-7n9jv 0/1 CrashLoopBackOff 42 (113s ago) 3h44m 10.128.18.199 kuryr-5sxhw-master-2 <none> <none> openshift-kube-apiserver installer-5-kuryr-5sxhw-master-1 0/1 Error 0 3h35m 10.128.68.176 kuryr-5sxhw-master-1 <none> <none> openshift-kube-controller-manager-operator kube-controller-manager-operator-746497b-dfbh5 0/1 CrashLoopBackOff 42 (2m23s ago) 3h44m 10.128.13.162 kuryr-5sxhw-master-2 <none> <none> openshift-kube-controller-manager installer-4-kuryr-5sxhw-master-0 0/1 Error 0 3h35m 10.128.65.186 kuryr-5sxhw-master-0 <none> <none> openshift-kube-scheduler-operator openshift-kube-scheduler-operator-695fb4449f-j9wqx 0/1 CrashLoopBackOff 42 (63s ago) 3h44m 10.128.44.194 kuryr-5sxhw-master-2 <none> <none> openshift-kube-scheduler installer-5-kuryr-5sxhw-master-0 0/1 Error 0 3h35m 10.128.60.44 kuryr-5sxhw-master-0 <none> <none> openshift-kube-storage-version-migrator-operator kube-storage-version-migrator-operator-6c5cd46578-qpk5z 0/1 CrashLoopBackOff 42 (2m18s ago) 3h44m 10.128.4.120 kuryr-5sxhw-master-2 <none> <none> openshift-machine-api cluster-autoscaler-operator-7b667675db-tmlcb 1/2 CrashLoopBackOff 46 (2m53s ago) 3h45m 10.128.28.146 kuryr-5sxhw-master-2 <none> <none> openshift-machine-api machine-api-controllers-fdb99649c-ldb7t 3/7 CrashLoopBackOff 184 (2m55s ago) 3h40m 10.128.29.90 kuryr-5sxhw-master-0 <none> <none> openshift-route-controller-manager route-controller-manager-d8f458684-7dgjm 0/1 CrashLoopBackOff 43 (100s ago) 3h36m 10.128.55.11 kuryr-5sxhw-master-2 <none> <none> openshift-service-ca-operator service-ca-operator-654f68c77f-g4w55 0/1 CrashLoopBackOff 42 (2m2s ago) 3h45m 10.128.22.30 kuryr-5sxhw-master-2 <none> <none> openshift-service-ca service-ca-5f584b7d75-mxllm 0/1 CrashLoopBackOff 42 (45s ago) 3h42m 10.128.49.250 kuryr-5sxhw-master-0 <none> <none>
$ oc get svc -A |grep 172.30.0.1 default kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 3h50m
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
In staging, BE 2.18.0 - Trying to set all validation IDs to be ignored with:
curl -X 'PUT' 'https://api.stage.openshift.com/api/assisted-install/v2/clusters/26a69b99-06a3-441b-be40-73cadbac6b6a/ignored-validations' --header "Authorization: Bearer $(ocm token)" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "host-validation-ids": "[]", "cluster-validation-ids": "[\"all\"]" }'
Getting this response:
{"code":"400","href":"","id":400,"kind":"Error","reason":"cannot proceed due to the following errors: Validation ID 'all' is not a known cluster validation"}
How reproducible:
100%
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
All ignorable validations should added to ignore list
Description of problem:
This came out of the investigation of https://issues.redhat.com/browse/OCPBUGS-11691 . The nested node configs used to support dual stack VIPs do not correctly respect the EnableUnicast setting. This is causing issues on EUS upgrades where the unicast migration cannot happen until all nodes are on 4.12. This is blocking both the workaround and the eventual proper fix.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Deploy 4.11 with unicast explicitly disabled (via MCO patch) 2. Write /etc/keepalived/monitor-user.conf to suppress unicast migration 3. Upgrade to 4.12
Actual results:
Nodes come up in unicast mode
Expected results:
Nodes remain in multicast mode until monitor-user.conf is removed
Additional info:
Description of problem:
In Reliability (loaded longrun) test, the memory of ovnkube-node-xxx pods on all 6 nodes keep increasing. Within 24 hours, increased to about 1.6G. I did not see this issue in previous releases.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-27-000502
How reproducible:
I met this issue the first time
Steps to Reproduce:
1. Install a AWS OVN cluster with 3 masters, 3 workers, vm_type are all m5.xlarge. 2. Run reliability-v2 test https://github.com/openshift/svt/tree/master/reliability-v2 with config: 1 admin, 15 dev-test, 1 dev-prod. The test will long run the configured tasks. 3. Monitor the test failures in and performance dashboard. Test failures slack notification: https://redhat-internal.slack.com/archives/C0266JJ4XM5/p1687944463913769 Performance dashboard:http://dittybopper-dittybopper.apps.qili-414-haproxy.qe-lrc.devcluster.openshift.com/d/IgK5MW94z/openshift-performance?orgId=1&from=1687944452000&to=now&refresh=1h
Actual results:
The memory of ovnkube-node-xxx pods on all 6 nodes keep increasing. Within 24 hours, increased to about 1.6G.
Expected results:
The memory of ovnkube-node-xxx pods
Additional info:
% oc adm top pod -n openshift-ovn-kubernetes | grep node ovnkube-node-4t282 146m 1862Mi ovnkube-node-9p462 41m 1847Mi ovnkube-node-b6rqj 46m 2032Mi ovnkube-node-fp2gn 72m 2107Mi ovnkube-node-hxf95 11m 2359Mi ovnkube-node-ql9fx 38m 2089Mi
I did a pprof heap on one of the pod and upload to heap-ovnkube-node-4t282.out
Must-gather is uploaded to must-gather.local.1315176578017655774.tar.gz
performance dashboard screenshot for ovnkube-node-memory.png
This is a clone of issue OCPBUGS-17906. The following is the description of the original issue:
—
Description of problem:
On Hypershift(Guest) cluster, EFS driver pod stuck at ContainerCreating state
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-11-055332
How reproducible:
Always
Steps to Reproduce:
1. Create Hypershift cluster. Flexy template: aos-4_14/ipi-on-aws/versioned-installer-ovn-hypershift-ci 2. Try to install EFS operator and driver from yaml file/web console as mentioned in below steps. a) Create iam role from ccoctl tool and will get ROLE ARN value from the output b) Install EFS operator using the above ROLE ARN value. c) Check EFS operator, node, controller pods are up and running // og-sub-hcp.yaml apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: generateName: openshift-cluster-csi-drivers- namespace: openshift-cluster-csi-drivers spec: namespaces: - "" --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: aws-efs-csi-driver-operator namespace: openshift-cluster-csi-drivers spec: channel: stable name: aws-efs-csi-driver-operator source: qe-app-registry sourceNamespace: openshift-marketplace config: env: - name: ROLEARN value: arn:aws:iam::301721915996:role/hypershift-ci-16666-openshift-cluster-csi-drivers-aws-efs-cloud- // driver.yaml apiVersion: operator.openshift.io/v1 kind: ClusterCSIDriver metadata: name: efs.csi.aws.com spec: logLevel: TraceAll managementState: Managed operatorLogLevel: TraceAll
Actual results:
aws-efs-csi-driver-controller-699664644f-dkfdk 0/4 ContainerCreating 0 87m
Expected results:
EFS controller pods should be up and running
Additional info:
oc -n openshift-cluster-csi-drivers logs aws-efs-csi-driver-operator-6758c5dc46-b75hb E0821 08:51:25.160599 1 base_controller.go:266] "AWSEFSDriverCredentialsRequestController" controller failed to sync "key", err: cloudcredential.operator.openshift.io "cluster" not found Discussion: https://redhat-internal.slack.com/archives/GK0DA0JR5/p1692606247221239 Installation steps epic: https://issues.redhat.com/browse/STOR-1421
Description of problem:
Set custom security group IDs in the following fields of install-config.yaml installconfig.controlPlane.platform.aws.additionalSecurityGroupIDs installconfig.compute.platform.aws.additionalSecurityGroupIDs such as: apiVersion: v1 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: aws: additionalSecurityGroupIDs: - sg-0d2f88b2980aa5547 - sg-01f1d2f60a3b4cf6d replicas: 3 compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: aws: additionalSecurityGroupIDs: - sg-03418b6e2f68e1f63 - sg-0376fc68fd4b834a4 replicas: 3 After installation, check the Security Groups attached to master and worker, master doesn’t have the specified custom security groups attached while workers have. For one of the masters: [root@preserve-gpei-worker ~]# aws ec2 describe-instances --instance-ids i-0cd007cca57c86ee9 --region us-west-2 --query 'Reservations[*].Instances[*].SecurityGroups[*]' --output json [ [ [ { "GroupName": "terraform-20230713031140984600000002", "GroupId": "sg-05495718555950f77" } ] ] ] For one of the workers: [root@preserve-gpei-worker ~]# aws ec2 describe-instances --instance-ids i-0572b7bde8ff07ac4 --region us-west-2 --query 'Reservations[*].Instances[*].SecurityGroups[*]' --output json [ [ [ { "GroupName": "gpei-0613a-worker-2", "GroupId": "sg-0376fc68fd4b834a4" }, { "GroupName": "gpei-0613a-worker-1", "GroupId": "sg-03418b6e2f68e1f63" }, { "GroupName": "terraform-20230713031140982700000001", "GroupId": "sg-0ce73044e426fe249" } ] ] ] Also checked the master’s controlplanemachineset, it does have the custom security groups configured, but they’re not attached to the master instance in the end. [root@preserve-gpei-worker k_files]# oc get controlplanemachineset -n openshift-machine-api cluster -o yaml |yq .spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.securityGroups - filters: - name: tag:Name values: - gpei-0613a-pzjbk-master-sg - id: sg-01f1d2f60a3b4cf6d - id: sg-0d2f88b2980aa5547
Version-Release number of selected component (if applicable):
registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-07-11-092038
How reproducible:
Always
Steps to Reproduce:
1. As mentioned above 2. 3.
Actual results:
masters doesn't have custom security groups added
Expected results:
masters should have custom security groups added like workers
Additional info:
In Hypershift CI, we see nil deref panic
I0801 06:35:38.203019 1 controller.go:182] Assigning key: ip-10-0-132-175.ec2.internal to node workqueue
E0801 06:35:38.567021 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 195 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x28103a0?, 0x47a6400})
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00088f260?})
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x28103a0, 0x47a6400})
/usr/lib/golang/src/runtime/panic.go:884 +0x213
github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*AWS).getSubnet(0xc000c05220, 0xc000d760b0)
/go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/aws.go:266 +0x24a
github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*AWS).GetNodeEgressIPConfiguration(0x0?, 0x31b8490?, {0x0, 0x0, 0x0})
/go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/aws.go:200 +0x185
github.com/openshift/cloud-network-config-controller/pkg/controller/node.(*NodeController).SyncHandler(0xc000d526e0, {0xc00005d7e0, 0x1c})
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/node/node_controller.go:129 +0x44f
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc00071f740, {0x25ff720?, 0xc00088f260?})
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x11c
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc00071f740)
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(...)
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x318e140, 0xc0005aa1e0}, 0x1, 0xc0000c4ba0)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25
created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x3aa
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x236d14a]
Code does an unprotected deref of `networkInterface.SubnetId` which appears to be `nil`, which is probably why multiple subnets are returned in the first place.
Description of problem:
MCO has duplicate feature flags set for Kubelet causing errors on bringup. {{code}} I0421 15:32:04.308472 2135 codec.go:98] "Using lenient decoding as strict decoding failed" err=< Apr 21 15:32:04 ip-10-0-156-156 kubenswrapper[2135]: strict decoding error: yaml: unmarshal errors: Apr 21 15:32:04 ip-10-0-156-156 kubenswrapper[2135]: line 29: key "RotateKubeletServerCertificate" already set in map Apr 21 15:32:04 ip-10-0-156-156 kubenswrapper[2135]: > {{code}}
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-19018. The following is the description of the original issue:
—
using metal-ipi on 4.14 the cluster is failing to come up,
the network cluster-operator is failing to start, the sdn pod shows the error
bash: RHEL_VERSION: unbound variable
Description of problem:
create new host and cluster folder qe-cluster under datacenter, and move cluster workloads into that folder.
$ govc find -type r /OCP-DC/host/qe-cluster/workloads
using below install-config.yaml file to create single zone cluster.
apiVersion: v1 baseDomain: qe.devcluster.openshift.com compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: vsphere: cpus: 4 memoryMB: 8192 osDisk: diskSizeGB: 60 zones: - us-east-1 replicas: 2 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: vsphere: cpus: 4 memoryMB: 16384 osDisk: diskSizeGB: 60 zones: - us-east-1 replicas: 3 metadata: name: jima-permission networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.19.46.0/24 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 platform: vsphere: apiVIP: 10.19.46.99 cluster: qe-cluster/workloads datacenter: OCP-DC defaultDatastore: my-nfs ingressVIP: 10.19.46.98 network: "VM Network" username: administrator@vsphere.local password: xxx vCenter: xxx vcenters: - server: xxx user: administrator@vsphere.local password: xxx datacenters: - OCP-DC failureDomains: - name: us-east-1 region: us-east zone: us-east-1a topology: datacenter: OCP-DC computeCluster: /OCP-DC/host/qe-cluster/workloads networks: - "VM Network" datastore: my-nfs server: xxx pullSecret: xxx
installer get error:
$ ./openshift-install create cluster --dir ipi5 --log-level debug DEBUG Generating Platform Provisioning Check... DEBUG Fetching Common Manifests... DEBUG Reusing previously-fetched Common Manifests DEBUG Generating Terraform Variables... FATAL failed to fetch Terraform Variables: failed to generate asset "Terraform Variables": failed to get vSphere network ID: could not find vSphere cluster at /OCP-DC/host//OCP-DC/host/qe-cluster/workloads: cluster '/OCP-DC/host//OCP-DC/host/qe-cluster/workloads' not found
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-05-053337
How reproducible:
always
Steps to Reproduce:
1. create new host/cluster folder under datacenter, and move vsphere cluster into that folder 2. prepare install-config with zone configuration 3. deploy cluster
Actual results:
fail to create cluster
Expected results:
succeed to create cluster
Additional info:
Description of problem:
In the control plane machine set operator we perform e2e periodic tests that check the ability to do a rolling update of an entire OCP control plane.
This is a quite involved test as we need to drain and replace all the master machines/nodes, altering operators, waiting for machines to come up + bootstrap and nodes to drain and move their workloads to others while respecting PDBs, and etcd quorum.
As such we need to make sure we are robust to transient issues, occasionaly slow-downs and network errors.
We have investigated these timeout issues and identified some common culprits that we want to address, see: https://redhat-internal.slack.com/archives/GE2HQ9QP4/p1678966522151799
Description of problem:
CPO reconciliation loop hangs after "Reconciling infrastructure status"
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Frequently
Steps to Reproduce:
1.Create a HostedCluster with a recent 4.14 release image 2.Watch CPO logs 3.
Actual results:
Reconcile gets stuck
Expected results:
Reconcile happens fairly quickly
Additional info:
Description of problem:
Cluster upgrade failure has been affecting three consecutive nightly payloads. https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-20-041508 https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-21-120836 https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-22-035713 In all three cases, upgrade seems to fail waiting on network. Take this job as an example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624 Cluster version operator complains about network operator has not finished upgrade: I0522 07:12:58.540244 1 sync_worker.go:1149] Update error 684 of 845: ClusterOperatorUpdating Cluster operator network is updating versions (*errors.errorString: cluster operator network is available and not degraded but has not finished updating to target version) This log can been seen in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5565f87cc6-6sjqf_cluster-version-operator.log The network operator keeps waiting with the following log: I0522 07:12:58.563312 1 connectivity_check_controller.go:166] ConnectivityCheckController is waiting for transition to desired version (4.14.0-0.nightly-2023-05-22-035713) to be completed. This lasted over 2 hours. The log can be seen in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/pods/openshift-network-operator_network-operator-6975b7b8ff-pdxzk_network-operator.log Compared with a working job, there seems to be an error getting *v1alpha1.PodNetworkConnectivityCheck in the openshift-network-diagnostics_network-check-source: W0522 04:34:18.527315 1 reflector.go:424] k8s.io/client-go@v12.0.0+incompatible/tools/cache/reflector.go:169: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io) E0522 04:34:18.527391 1 reflector.go:140] k8s.io/client-go@v12.0.0+incompatible/tools/cache/reflector.go:169: Failed to watch *v1alpha1.PodNetworkConnectivityCheck: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io) It is not clear whether this is really relevant. Also worth mentioning is that, every time when this problem happens, machine-config and dns also stuck with the older version. This has been affecting 4.14 nightly payload three times. If it shows more consistency, we might have to increase the severity of the bug. Please ping TRT if any more info is needed.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
During installation: level=error msg=Error: reading Security Group (sg-0f07c871bdbd6379f) Rules: UnauthorizedOperation: You are not authorized to perform this operation. level=error msg= status code: 403, request id: f3e18ac0-f2fc-471f-8055-7194112c8225 Users are unable to create the security groups for the bootstrap node
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Warning/Error should come up when the permission does not exist.
Additional info:
Starting with https://amd64.origin.releases.ci.openshift.org/releasestream/4.13.0-0.okd/release/4.13.0-0.okd-2023-02-28-170012 multiple storage tests are failing:
[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] volumes should store data [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | : [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] volumes should store data [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] volumes should store data [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] provisioning should provision storage with pvc data source [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | : [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] provisioning should provision storage with pvc data source [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] provisioning should provision storage with pvc data source [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block] [Testpattern: Pre-provisioned PV (block volmode)] volumes should store data [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | : [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block] [Testpattern: Pre-provisioned PV (block volmode)] volumes should store data [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block] [Testpattern: Pre-provisioned PV (block volmode)] volumes should store data [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] PersistentVolumes-local [Volume type: block] One pod requesting one prebound PVC should be able to mount volume and write from pod1 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | : [sig-storage] PersistentVolumes-local [Volume type: block] One pod requesting one prebound PVC should be able to mount volume and write from pod1 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] PersistentVolumes-local [Volume type: block] One pod requesting one prebound PVC should be able to mount volume and write from pod1 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource] [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | : [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource] [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource] [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] PersistentVolumes-local [Volume type: block] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | : [sig-storage] PersistentVolumes-local [Volume type: block] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] PersistentVolumes-local [Volume type: block] Two pods mounting a local volume at the same time should be able to write from pod1 and read from pod2 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] PersistentVolumes-local [Volume type: block] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | : [sig-storage] PersistentVolumes-local [Volume type: block] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] PersistentVolumes-local [Volume type: block] One pod requesting one prebound PVC should be able to mount volume and read from pod1 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] PersistentVolumes-local [Volume type: block] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | : [sig-storage] PersistentVolumes-local [Volume type: block] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] PersistentVolumes-local [Volume type: block] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2 [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] In-tree Volumes [Driver: aws] [Testpattern: Dynamic PV (block volmode)] volumes should store data [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | : [sig-storage] In-tree Volumes [Driver: aws] [Testpattern: Dynamic PV (block volmode)] volumes should store data [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] In-tree Volumes [Driver: aws] [Testpattern: Dynamic PV (block volmode)] volumes should store data [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more : [sig-storage] In-tree Volumes [Driver: aws] [Testpattern: Pre-provisioned PV (block volmode)] volumes should store data [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more | : [sig-storage] In-tree Volumes [Driver: aws] [Testpattern: Pre-provisioned PV (block volmode)] volumes should store data [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_more
cc Hemant Kumar
When we try to create a cluster with --secret-creds, an MCE AWS k8s secret that includes aws-creds, pull secret, and base domain, then the binary should not ask for pull secret. However, it does now after changing from hypershift.
Adding pull secret param will allow the command to continue as expected, though I would think whole point of the secret-creds is to reuse what exists.
/usr/local/bin/hcp create cluster aws --name acmqe-hc-ad5b1f645d93464c --secret-creds test1-cred --region us-east-1 --node-pool-replicas 1 --namespace local-cluster --instance-type m6a.xlarge --release-image quay.io/openshift-release-dev/ocp-release:4.14.0-ec.4-multi --generate-ssh Output: Error: required flag(s) "pull-secret" not set required flag(s) "pull-secret" not set
2.4.0-DOWNANDBACK-2023-08-31-13-34-02 or mce 2.4.0-137
hcp version openshift/hypershift: 8b4b52925d47373f3fe4f0d5684c88dc8a93368a. Latest supported OCP: 4.14.0
always
Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-13061.
Description of problem:
When fresh normal user visit BuildConfigs page of 'default' project, we can see error page
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-05-191022
How reproducible:
Always
Steps to Reproduce:
1. normal user without any projects login to console 2. switch to Admin perspective 3. Visit workloads page for 'default' project, for example /k8s/ns/default/route.openshift.io~v1~Route /k8s/ns/default/core~v1~Service /k8s/ns/default/apps~v1~Deployment /k8s/ns/default/build.openshift.io~v1~BuildConfig
Actual results:
3. We can see an error page when visiting BuildConfigs page
Expected results:
3. no error should be shown and show consistent info with other workloads page
Additional info:
Description of problem:
Repository creation in console ask for a mandate secret, does not allow to create repository even for public git url which is weird. However it's working fine with ocp cli
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create repository crd via openshift console 2. 3.
Actual results:
It does not allow me to create the repository
Expected results:
We should be able to create repository crd
Additional info:
slack thread: https://redhat-internal.slack.com/archives/C6A3NV5J9/p1691057766516119
Description of problem:
With 120+ node clusters, we are seeing O(10) larger rate of patch node requests coming from node service accounts. These higher rate of updates are causing issues where "nodes" watchers are being terminated, causing storm of watch requests that increases CPU load on the cluster. What I see is node resourceVersions are incremented rapidly and in large bursts and watchers are terminated as a result.
Version-Release number of selected component (if applicable):
4.14.0-ec.4 4.14.0-0.nightly-2023-08-08-222204 4.13.0-0.nightly-2023-08-10-021434
How reproducible:
Repeatable
Steps to Reproduce:
1. Create 4.14 cluster with 120 nodes with m5.8xlarge control plane and c5.4xlarge workers. 2. Run `oc get nodes -w -o custom-columns='NAME:.metadata.name,RV:.metadata.resourceVersion' ` 3. Wait for a big chunk of nodes to be updated and observe the watch terminate. 4. Optionally run `kube-burner ocp node-density-cni --pods-per-node=100` to generate some load.
Actual results:
kube-apiserver audit events show >1500 node patch requests from a single node SA in a certain amount of time: 1678 ["system:node:ip-10-0-69-142.us-west-2.compute.internal",null] 1679 ["system:node:ip-10-0-33-131.us-west-2.compute.internal",null] 1709 ["system:node:ip-10-0-41-44.us-west-2.compute.internal",null] Observe that apiserver_terminated_watchers_total{resource="nodes"} starts to increment before 120 node scaleup is even complete.
Expected results:
patch requests in certain amount of time are more aligned with what we see on 4.13*08-10* nightly: 57 ["system:node:ip-10-0-247-122.us-west-2.compute.internal",null] 62 ["system:node:ip-10-0-239-217.us-west-2.compute.internal",null] 63 ["system:node:ip-10-0-165-255.us-west-2.compute.internal",null] 64 ["system:node:ip-10-0-136-122.us-west-2.compute.internal",null] Observe that apiserver_terminated_watchers_total{resource="nodes"} does not increment. Observe that rate of mutating node requests levels off after nodes are created.
Additional info:
Suspecting these updates coming from nodes could be part of a response to the MCO controllerconfigs resource being updated every few minutes or more frequently. One of the suspected causes of increased kube-apiserer CPU usage investigation of ovn-ic.
An upstream partial fix to logging means that the BMO log now contains a mixture of structured and unstructured logs, making it impossible to read with the structured log parsing tool (bmo-log-parse) we use for debugging customer issues.
This is fixed upstream by https://github.com/metal3-io/baremetal-operator/pull/1249, which will get picked up automatically in 4.14 but which needs to be backported to 4.13.
Description of problem:
Currrently, only one ServerGroup is created in OpenStack when 3 masters on 3 AZs are deployed while 3 should have been created (one per AZ). With the work on CPMS, we made the decision to only create one ServerGroup for the masters. However, this will require a change in the installer to reflect this decision. Indeed, when specifying AZs, the master machines would have their own ServerGroup, while only one actually existed in OpenStack. This was a mistake but instead of fixing that bug, we'll change the behaviour to have only one ServerGroup for masters.
Version-Release number of selected component (if applicable):
latest (4.14)
How reproducible: deploy a control plane with 3 failure domains:
controlPlane: name: master platform: openstack: type: m1.xlarge failureDomains: - computeAvailabilityZone: az0 - computeAvailabilityZone: az1 - computeAvailabilityZone: az2
Steps to Reproduce:
1. Deploy the control plane in 3 AZ 2. List OpenStack Compute Server Groups
Actual results:
+--------------------------------------+--------------------------+--------------------+ | ID | Name | Policy | +--------------------------------------+--------------------------+--------------------+ | 0750c579-d2cf-41b3-9e88-003dcbcad0c5 | refarch-jkn8g-master-az0 | soft-anti-affinity | | 05715c08-ac2b-439d-9bd5-5803ac40c322 | refarch-jkn8g-worker | soft-anti-affinity | +--------------------------------------+--------------------------+--------------------+
Expected results without our work on CPMS:
refarch-jkn8g-master-az1 and refarch-jkn8g-master-az2 should have been created.
This expectation is purely for documentation, QE should ignore it.
Expected results with our work on CPMS (which should be taken in account by QE when testing CPMS):
refarch-jkn8g-master-az0 should not exist, and the ServerGroup should be named refarch-jkn8g-master. All the masters should use that ServerGroup in both the Nova instance properties and in the MachineSpec once the machines are enrolled by CCPMSO.
Description of problem:
4.14 nightly HyperShift hosted cluster aws-pod-identity does not work. Pods are not injected env vars AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE.
In 4.13 HyperShift hosted cluster, it works well, see Additional info.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-11-055332
How reproducible:
Always
Steps to Reproduce:
1. $ export KUBECONFIG=/path/to/hypershift-hosted-cluster/kubeconfig $ ogcv NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-08-11-055332 True False 8h Cluster version is 4.14.0-0.nightly-2023-08-11-055332 $ oc get mutatingwebhookconfigurations --context admin NAME WEBHOOKS AGE aws-pod-identity 1 6h5m $ oc get --raw=/.well-known/openid-configuration | jq -r '.issuer' https://xxxx.s3.us-east-2.amazonaws.com/hypershift-xxxx 2. $ oc new-project xxia-proj $ oc create sa aws-provider serviceaccount/aws-provider created 3. $ ccoctl aws create-iam-roles --name=xxia --region=$REGION --credentials-requests-dir=credentialsrequest-dir-aws --identity-provider-arn=arn:aws:iam::xxxx:oidc-provider/xxxx.s3.us-east-2.amazonaws.com/hypershift-xxxx --output-dir=credrequests-ccoctl-output 2023/08/24 17:54:32 Role arn:aws:iam::xxxx:role/xxia-xxia-proj-aws-creds created 2023/08/24 17:54:32 Saved credentials configuration to: credrequests-ccoctl-output/manifests/xxia-proj-aws-creds-credentials.yaml 2023/08/24 17:54:32 Updated Role policy for Role xxia-xxia-proj-aws-creds 4. $ oc annotate sa/aws-provider eks.amazonaws.com/role-arn="arn:aws:iam::xxxx:role/xxia-xxia-proj-aws-creds" $ oc create deployment aws-cli --image=amazon/aws-cli --dry-run=client -o yaml -- sleep 360d | sed "/containers/i \ serviceAccountName: aws-provider" | oc create -f - deployment.apps/aws-cli created $ oc get po NAME READY STATUS RESTARTS AGE aws-cli-5c4f6d7d5b-g6d5v 1/1 Running 0 18s 5. $ oc rsh aws-cli-5c4f6d7d5b-g6d5v sh-4.2$ env | grep AWS sh-4.2$ ls /var/run/secrets/eks.amazonaws.com/serviceaccount/token ls: cannot access /var/run/secrets/eks.amazonaws.com/serviceaccount/token: No such file or directory sh-4.2$ exit command terminated with exit code 1
Actual results:
5. No AWS env vars.
Expected results:
5. Should have AWS env vars.
Additional info:
In 4.13 HyperShift hosted cluster, it works well:
1. $ ogcv NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2023-08-11-101506 True False 10h Cluster version is 4.13.0-0.nightly-2023-08-11-101506 $ oc get --raw=/.well-known/openid-configuration | jq -r '.issuer' https://aos-xxxx.s3.us-east-2.amazonaws.com/xxxx $ oc get no NAME STATUS ROLES AGE VERSION ip-10-0-139-76.us-east-2.compute.internal Ready worker 10h v1.26.6+6bf3f75 ... $ REGION=us-east-2 2. $ oc new-project xxia-proj $ oc create sa aws-provider 3. $ ccoctl aws create-iam-roles --name=xxia-test --region=$REGION --credentials-requests-dir=credentialsrequest-dir-aws --identity-provider-arn=arn:aws:iam::xxxx:oidc-provider/aos-xxxx.s3.us-east-2.amazonaws.com/xxxx --output-dir=credrequests-ccoctl-output 2023/08/24 20:06:53 Role arn:aws:iam::xxxx:role/xxia-test-xxia-proj-aws-creds created 2023/08/24 20:06:53 Saved credentials configuration to: credrequests-ccoctl-output/manifests/xxia-proj-aws-creds-credentials.yaml 2023/08/24 20:06:53 Updated Role policy for Role xxia-test-xxia-proj-aws-creds 4. $ oc annotate sa/aws-provider eks.amazonaws.com/role-arn="arn:aws:iam::xxxx:role/xxia-test-xxia-proj-aws-creds" $ oc create deployment aws-cli --image=amazon/aws-cli --dry-run=client -o yaml -- sleep 360d | sed "/containers/i \ serviceAccountName: aws-provider" | oc create -f - $ oc get pod NAME READY STATUS RESTARTS AGE aws-cli-84875995cc-svszl 1/1 Running 0 16s 5. $ oc rsh aws-cli-84875995cc-svszl sh-4.2$ env | grep AWS AWS_ROLE_ARN=arn:aws:iam::xxxx:role/xxia-test-xxia-proj-aws-creds AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token AWS_DEFAULT_REGION=us-east-2 AWS_REGION=us-east-2
Description of problem:
When upgrading a 4.11.33 cluster to 4.12.21, the Cluster Version Operator is stuck waiting for the Network Operator to update: $ omc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.43 True True 14m Working towards 4.12.21: 672 of 831 done (80% complete), waiting on network CVO pod log states: 2023-06-16T12:07:22.596127142Z I0616 12:07:22.596023 1 metrics.go:490] ClusterOperator network is not setting the 'operator' version Indeed the NO version is empty: $ omc get co network -o json|jq '.status.versions' null However, it's status is available and not progressing, not degraded: $ omc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE network True False False 19m Network operator pod log states: 2023-06-16T12:08:56.542287546Z I0616 12:08:56.542271 1 connectivity_check_controller.go:138] ConnectivityCheckController is waiting for transition to desired version (4.12.21) to be completed. 2023-06-16T12:04:40.584407589Z I0616 12:04:40.584349 1 ovn_kubernetes.go:1437] OVN-Kubernetes master and node already at release version 4.12.21; no changes required The Network Operator pod, however, has the version correctly: $ omc get pods -n openshift-network-operator -o jsonpath='{.items[].spec.containers[0].env[?(@.name=="RELEASE_VERSION")]}'|jq { "name": "RELEASE_VERSION", "value": "4.12.21" } Restarts of the related pods had no effect. I have trace logs of the Network Operator available. It looked like it might be related to https://github.com/openshift/cluster-network-operator/pull/1818 but that looks to be code introduced in 4.14.
Version-Release number of selected component (if applicable):
How reproducible:
I have not reproduced.
Steps to Reproduce:
1. Cluster version began at stable 4.10.56 2. Upgraded to 4.11.43 successfully 3. Upgraded to 4.12.21 and is stuck.
Actual results:
CVO Stuck waiting on NO to complete, NO
Expected results:
NO to update its version so the CVO can continue.
Additional info:
Bare Metal IPI cluster with OVN Networking.
This is a clone of issue OCPBUGS-18396. The following is the description of the original issue:
—
CI is almost perma failing on mtu migration in 4.14 (both SDN and OVN-Kubernetes):
Looks like the common issue is waiting for MCO times out:
+ echo '[2023-08-31T03:58:16+00:00] Waiting for final Machine Controller Config...' [2023-08-31T03:58:16+00:00] Waiting for final Machine Controller Config... + timeout 900s bash migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO ...
Description of problem:
[vmware csi driver] vsphere-syncher does not retry populate the CSINodeTopology with topology information when registration fails When syncer starts it watches for node events, but it does not retry if registration fails and in the meanwhile any csinodetopoligy requests might not get served, because VM is not found
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-05-04-090524
How reproducible:
Randomly
Steps to Reproduce:
1. Install OCP cluster by UPI with encrypt 2. Check the cluster storage operator not degrade
Actual results:
cluster storage operator degrade that VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods ... 2023-05-09T06:06:22.146861934Z I0509 06:06:22.146850 1 main.go:183] ServeMux listening at "0.0.0.0:10300" 2023-05-09T06:07:00.283007138Z E0509 06:07:00.282912 1 main.go:64] failed to establish connection to CSI driver: context canceled 2023-05-09T06:07:07.283109412Z W0509 06:07:07.283061 1 connection.go:173] Still connecting to unix:///csi/csi.sock ... # Many error logs in csi driver related timed out while waiting for topology labels to be updated in \"compute-2\" CSINodeTopology instance . ... 2023-05-09T06:19:16.499856730Z {"level":"error","time":"2023-05-09T06:19:16.499687071Z","caller":"k8sorchestrator/topology.go:837","msg":"timed out while waiting for topology labels to be updated in \"compute-2\" CSINodeTopology instance.","TraceId":"b8d9305e-9681-4eba-a8ac-330383227a23","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/common/commonco/k8sorchestrator.(*nodeVolumeTopology).GetNodeTopologyLabels\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/common/commonco/k8sorchestrator/topology.go:837\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).NodeGetInfo\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/node.go:429\ngithub.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6231\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:922"} ...
Expected results:
Install vsphere ocp cluster succeed and the cluster storage operator is healthy
Additional info:
Version:
$ openshift-install version
./openshift-install 4.11.0-0.nightly-2022-07-13-131410
built from commit cdb9627de7efb43ad7af53e7804ddd3434b0dc58
release image registry.ci.openshift.org/ocp/release@sha256:c5413c0fdd0335e5b4063f19133328fee532cacbce74105711070398134bb433
release architecture amd64
Platform:
What happened?
When one creates an IPI Azure cluster with an `internal` publishing method, it creates a standard load balancer with an empty definition. This load balancer doesn't serve a purpose as far as I can tell since the configuration is completely empty. Because it doesn't have a public IP address and backend pools it's not providing any outbound connectivity, and there are no frontend IP configurations for ingress connectivity to the cluster.
Below is the ARM template that is deployed by the installer (through terraform)
```
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"loadBalancers_mgahagan411_7p82n_name":
},
"variables": {},
"resources": [
{
"type": "Microsoft.Network/loadBalancers",
"apiVersion": "2020-11-01",
"name": "[parameters('loadBalancers_mgahagan411_7p82n_name')]",
"location": "northcentralus",
"sku":
,
"properties":
}
]
}
```
What did you expect to happen?
How to reproduce it (as minimally and precisely as possible)?
1. Create an IPI cluster with the `publish` installation config set to `Internal` and the `outboundType` set to `UserDefinedRouting`.
```
apiVersion: v1
controlPlane:
architecture: amd64
hyperthreading: Enabled
name: master
platform:
azure: {}
replicas: 3
compute:
2. Show the json content of the standard load balancer is completely empty
`az network lb show -g myResourceGroup -n myLbName`
```
{
"name": "mgahagan411-7p82n",
"id": "/subscriptions/00000000-0000-0000-00000000/resourceGroups/mgahagan411-7p82n-rg/providers/Microsoft.Network/loadBalancers/mgahagan411-7p82n",
"etag": "W/\"40468fd2-e56b-4429-b582-6852348b6a15\"",
"type": "Microsoft.Network/loadBalancers",
"location": "northcentralus",
"tags": {},
"properties":
,
"sku":
}
```
As a developer, I would like to make sure we are using the latest versions of the dependencies we utilize in the /hack/tools/go.mod file.
Description of problem:
4.12.0-0.nightly-2022-09-08-114806 AWS cluster, "remote error: tls: bad certificate" is in prometheus-operator-admission-webhook logs, should be a regression issue, no such issue in 4.11 and the defect does not block the function, it seems it's from AWS
$ oc -n openshift-monitoring get pod | grep prometheus-operator-admission-webhook prometheus-operator-admission-webhook-7d8fd8b5bb-kjh4f 1/1 Running 0 3h prometheus-operator-admission-webhook-7d8fd8b5bb-whl5n 1/1 Running 0 3h $ oc -n openshift-monitoring logs prometheus-operator-admission-webhook-7d8fd8b5bb-kjh4f level=info ts=2022-09-08T23:32:53.782445094Z caller=main.go:130 address=[::]:8443 msg="Starting TLS enabled server" ts=2022-09-08T23:33:09.057366056Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:52820: remote error: tls: bad certificate" ts=2022-09-08T23:33:10.071639453Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:52830: remote error: tls: bad certificate" ts=2022-09-08T23:33:12.07959313Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:52842: remote error: tls: bad certificate" ts=2022-09-08T23:33:31.729332249Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39188: remote error: tls: bad certificate" ts=2022-09-08T23:33:32.7374936Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39196: remote error: tls: bad certificate" ts=2022-09-08T23:33:34.745945871Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39206: remote error: tls: bad certificate" ts=2022-09-08T23:33:57.460069283Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37500: remote error: tls: bad certificate" ts=2022-09-08T23:33:58.469984958Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37508: remote error: tls: bad certificate" ts=2022-09-08T23:34:00.479578826Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:40948: remote error: tls: bad certificate" ts=2022-09-08T23:36:22.861562723Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:53866: remote error: tls: bad certificate" ts=2022-09-08T23:36:24.870186206Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:53882: remote error: tls: bad certificate" ts=2022-09-08T23:39:43.613375962Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:38780: remote error: tls: bad certificate" ts=2022-09-08T23:39:45.621205524Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:38792: remote error: tls: bad certificate" ts=2022-09-08T23:46:03.653578785Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:57878: remote error: tls: bad certificate" ts=2022-09-08T23:46:05.662237056Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:57890: remote error: tls: bad certificate" ts=2022-09-08T23:49:08.643599472Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:48340: remote error: tls: bad certificate" ts=2022-09-08T23:52:08.809838473Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:51682: remote error: tls: bad certificate" ts=2022-09-08T23:52:09.817050146Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:51698: remote error: tls: bad certificate" ts=2022-09-08T23:55:11.862993344Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54280: remote error: tls: bad certificate" ts=2022-09-08T23:58:15.820629264Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:59462: remote error: tls: bad certificate" ts=2022-09-09T00:01:17.913920461Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:47320: remote error: tls: bad certificate" ts=2022-09-09T00:04:21.086495988Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:52438: remote error: tls: bad certificate" ts=2022-09-09T00:07:24.050365477Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55148: remote error: tls: bad certificate" ts=2022-09-09T00:07:27.066559749Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55168: remote error: tls: bad certificate" ts=2022-09-09T00:10:28.193017562Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42222: remote error: tls: bad certificate" ts=2022-09-09T00:10:30.201598245Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:59802: remote error: tls: bad certificate" ts=2022-09-09T00:13:30.282592276Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:45648: remote error: tls: bad certificate" ts=2022-09-09T00:13:31.290450933Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:45654: remote error: tls: bad certificate" ts=2022-09-09T00:13:33.298604517Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:45668: remote error: tls: bad certificate" ts=2022-09-09T00:16:33.274732648Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:56710: remote error: tls: bad certificate" ts=2022-09-09T00:19:39.47117325Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54978: remote error: tls: bad certificate" ts=2022-09-09T00:25:43.708275724Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54638: remote error: tls: bad certificate" ts=2022-09-09T00:28:46.627225713Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:58124: remote error: tls: bad certificate" ts=2022-09-09T00:28:48.63515681Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39454: remote error: tls: bad certificate" ts=2022-09-09T00:31:51.728153893Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:56894: remote error: tls: bad certificate" ts=2022-09-09T00:34:52.775067246Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:34884: remote error: tls: bad certificate" ts=2022-09-09T00:41:00.843743907Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:41784: remote error: tls: bad certificate" ts=2022-09-09T00:44:00.933970145Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:36150: remote error: tls: bad certificate" ts=2022-09-09T00:44:03.949135311Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:36166: remote error: tls: bad certificate" ts=2022-09-09T00:47:03.97630552Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:44732: remote error: tls: bad certificate" ts=2022-09-09T00:47:06.991580657Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:44748: remote error: tls: bad certificate" ts=2022-09-09T00:50:08.31637565Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54092: remote error: tls: bad certificate" ts=2022-09-09T00:53:11.264559449Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:43144: remote error: tls: bad certificate" ts=2022-09-09T00:59:16.306282415Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39864: remote error: tls: bad certificate" ts=2022-09-09T00:59:17.314074479Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39878: remote error: tls: bad certificate" ts=2022-09-09T00:59:19.32313415Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:56104: remote error: tls: bad certificate" ts=2022-09-09T01:08:25.613927992Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:44280: remote error: tls: bad certificate" ts=2022-09-09T01:08:26.622625145Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:44290: remote error: tls: bad certificate" ts=2022-09-09T01:08:28.631034721Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:48838: remote error: tls: bad certificate" ts=2022-09-09T01:11:28.704732265Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37372: remote error: tls: bad certificate" ts=2022-09-09T01:11:31.723552093Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37392: remote error: tls: bad certificate" ts=2022-09-09T01:17:34.794690109Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46750: remote error: tls: bad certificate" ts=2022-09-09T01:17:35.803918438Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46752: remote error: tls: bad certificate" ts=2022-09-09T01:17:37.812700046Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46768: remote error: tls: bad certificate" ts=2022-09-09T01:20:38.79326772Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:53880: remote error: tls: bad certificate" ts=2022-09-09T01:23:41.073187846Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46086: remote error: tls: bad certificate" ts=2022-09-09T01:23:44.088529273Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46090: remote error: tls: bad certificate" ts=2022-09-09T01:26:44.077154097Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54234: remote error: tls: bad certificate" ts=2022-09-09T01:26:45.085277729Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54248: remote error: tls: bad certificate" ts=2022-09-09T01:26:47.092797767Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54254: remote error: tls: bad certificate" ts=2022-09-09T01:29:48.255127155Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39536: remote error: tls: bad certificate" ts=2022-09-09T01:29:50.263225272Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:56030: remote error: tls: bad certificate" ts=2022-09-09T01:32:51.618334928Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42836: remote error: tls: bad certificate" ts=2022-09-09T01:32:53.627565113Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42844: remote error: tls: bad certificate" ts=2022-09-09T01:35:56.945306145Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:57828: remote error: tls: bad certificate" ts=2022-09-09T01:38:57.721110974Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:54038: remote error: tls: bad certificate" ts=2022-09-09T01:41:59.901865996Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46096: remote error: tls: bad certificate" ts=2022-09-09T01:42:00.903596845Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:46102: remote error: tls: bad certificate" ts=2022-09-09T01:45:03.034044637Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55868: remote error: tls: bad certificate" ts=2022-09-09T01:45:04.042270514Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55874: remote error: tls: bad certificate" ts=2022-09-09T01:45:06.05067642Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55888: remote error: tls: bad certificate" ts=2022-09-09T01:48:06.178001976Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:56024: remote error: tls: bad certificate" ts=2022-09-09T01:48:09.192075072Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37562: remote error: tls: bad certificate" ts=2022-09-09T01:51:10.203900665Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:33016: remote error: tls: bad certificate" ts=2022-09-09T01:51:12.212458619Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:33022: remote error: tls: bad certificate" ts=2022-09-09T01:54:13.294550312Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:38042: remote error: tls: bad certificate" ts=2022-09-09T01:57:15.292731466Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:43838: remote error: tls: bad certificate" ts=2022-09-09T02:00:19.408152102Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42838: remote error: tls: bad certificate" ts=2022-09-09T02:00:21.41717724Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42842: remote error: tls: bad certificate" ts=2022-09-09T02:03:21.342937844Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55026: remote error: tls: bad certificate" ts=2022-09-09T02:03:22.350450637Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:55034: remote error: tls: bad certificate" ts=2022-09-09T02:06:25.421123942Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:34882: remote error: tls: bad certificate" ts=2022-09-09T02:06:27.428721002Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:34884: remote error: tls: bad certificate" ts=2022-09-09T02:09:28.541378288Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:52888: remote error: tls: bad certificate" ts=2022-09-09T02:12:31.610427648Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:47430: remote error: tls: bad certificate" ts=2022-09-09T02:12:33.618581498Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:47434: remote error: tls: bad certificate" ts=2022-09-09T02:15:33.601606956Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37706: remote error: tls: bad certificate" ts=2022-09-09T02:15:36.617807944Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:37730: remote error: tls: bad certificate" ts=2022-09-09T02:18:37.815046583Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:45066: remote error: tls: bad certificate" ts=2022-09-09T02:18:39.822858743Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:39614: remote error: tls: bad certificate" ts=2022-09-09T02:21:40.885368415Z caller=stdlib.go:105 caller=server.go:3195 msg="http: TLS handshake error from 10.128.0.9:42250: remote error: tls: bad certificate"
Version-Release number of selected component (if applicable):
"remote error: tls: bad certificate" is in prometheus-operator-admission-webhook logs
How reproducible:
always
Steps to Reproduce:
1. check prometheus-operator-admission-webhook logs.
Actual results:
"remote error: tls: bad certificate" is in prometheus-operator-admission-webhook logs
Expected results:
no error logs
Additional info:
Description of problem:
Facing the same issue as JIRA[1] in OCP 4.12 and for the backport this bug solution to the OCP 4.12 JIRA[1]: https://issues.redhat.com/browse/OCPBUGS-14064 As port 9447 is exposed from the cluster in one of the control nodes and is using weak cipher and TLS 1.0/ TLS 1.1 , this is incompatible with the security standards for our product release. Either we should be able to disable this port or update the cipher and TLS version as the fix for meeting the security standards as you are aware TLS 1.0 & TLS 1.1 are pretty old and deprecated already. we confirmed that fips were enabled during cluster deployment by passing the key-value pair in the config file."~~~ fips: true On JIRA[1] it is suggested to open a separate Bug for backporting.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
controller: Drop noisy log message about certificates
I often turn to the controller pod logs to debug issues, and
this log message is repeated very often. While it was
probably useful at the time the feature was being developed/tested
I doubt it will be necessary in the future.
In the end, the status really is the debugging frontend I believe.
controller: Drop noisy BaseOSContainerImage log message
In general we should avoid logging unless something changed.
I don't believe we need this log message, we can detect OS
changes from e.g. the MCD logs.
Description of problem:
The HyperShift KubeVirt (openshift virtualization) platform has worker nodes that are hosted by KubeVirt virtual machines. The worker node's internal IP address is interpreted by inspecting the kubevirt vmi's vmi.status.interface field. Due to the way the vmi.status.interface field sources its information from the qemu guest agent, that field is not guaranteed to remain static in some scenarios, such as soft reboot or when the qemu agent is temporarily unavailable. During these situations, the interfaces list will be empty. When the interfaces list is empty on the vmi, there are Hypershift related components (cloud-provider-kubevirt and cluster-api-provider-kubevirt) which strip the worker nodes internal IP. This stripping of the node's internal IP causes unpredictable behavior that results in connectivity failures from the KAS to the worker node kubelets. To address this, the Hypershift related kubevirt components need to only update the Internal IP of worker nodes when the vmi.status.interfaces list has an IP for the default interface. Othewise these hypershift components should use the last known internal IP address rather than stripping the internal IP address from the node.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100% given enough time and the right environment.
Steps to Reproduce:
1. create a hypershift kubevirt guest cluster 2. run the csi conformance test suite in a loop (this test suite causes the vmi.status.interfaces list to become unstable briefly at times)
Actual results:
the csi test suite will eventually begin failing due to inabiilty to pod exec into worker node pods. This is caused by the node's internal IP being removed.
Expected results:
csi conformance should pass reliably
Additional info:
We have occasional cases where admins attempt a rollback, despite long-standing docs:
Only upgrading to a newer version is supported. Reverting or rolling back your cluster to a previous version is not supported. If your update fails, contact Red Hat support.
Deeper history for that content here, here, and here. We could refuse to accept rollbacks unless the administrator sets Force to waive our guards.
From wking:
$ git --no-pager grep OCPBUGS-10218
test/e2e/nodepool_test.go: // TODO: (csrwng) Re-enable when https://issues.redhat.com/browse/OCPBUGS-10218
is fixed
test/e2e/nodepool_test.go: // TODO: (jparrill) Re-enable when https://issues.redhat.com/browse/OCPBUGS-10218
is fixed
but https://issues.redhat.com/browse/OCPBUGS-10218 was closed as a dup of https://issues.redhat.com/browse/OCPBUGS-10485 , and OCPBUGS-10485 is Verified with happy sounds for both 4.13 and 4.14 nightlies
Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/48
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When working with Horizontal Nav the component doesn't re-render when location changes. Currently it only updates itself when basePath changes. The location change based re-render was triggered by withRouter HoC previously but was recently removed.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
1/1
Steps to Reproduce:
1. Go to Storage -> ODF (version 4.13-pre-release) 2. Click on Storage System Tab and then Topology tab 3.
Actual results:
The selected tab doesn't get highlighted as active tab.
Expected results:
The selected tab should have the active blue color.
Additional info:
This is a clone of issue OCPBUGS-18498. The following is the description of the original issue:
—
Description of problem:
If not installed capability operator build and deploymentconfig, when use `oc new-app registry.redhat.io/<namespace>/<image>:<tag>` , the created deployment emptied spec.containers[0].image. The deploy will fail to start pod.
Version-Release number of selected component (if applicable):
oc version Client Version: 4.14.0-0.nightly-2023-08-22-221456 Kustomize Version: v5.0.1 Server Version: 4.14.0-0.nightly-2023-09-02-132842 Kubernetes Version: v1.27.4+2c83a9f
How reproducible:
Always
Steps to Reproduce:
1. Installed cluster without build/deploymentconfig function Set "baselineCapabilitySet: None" in install-config 2.Create a deploy using 'new-app' cmd oc new-app registry.redhat.io/ubi8/httpd-24:latest 3.
Actual results:
2. $oc new-app registry.redhat.io/ubi8/httpd-24:latest --> Found container image c412709 (11 days old) from registry.redhat.io for "registry.redhat.io/ubi8/httpd-24:latest" Apache httpd 2.4 ---------------- Apache httpd 2.4 available as container, is a powerful, efficient, and extensible web server. Apache supports a variety of features, many implemented as compiled modules which extend the core functionality. These can range from server-side programming language support to authentication schemes. Virtual hosting allows one Apache installation to serve many different Web sites. Tags: builder, httpd, httpd-24 * An image stream tag will be created as "httpd-24:latest" that will track this image--> Creating resources ... imagestream.image.openshift.io "httpd-24" created deployment.apps "httpd-24" created service "httpd-24" created --> Success Application is not exposed. You can expose services to the outside world by executing one or more of the commands below: 'oc expose service/httpd-24' Run 'oc status' to view your app 3. oc get deploy -o yaml apiVersion: v1 items: - apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"httpd-24:latest"},"fieldPath":"spec.template.spec.containers[?(@.name==\"httpd-24\")].image"}]' openshift.io/generated-by: OpenShiftNewApp creationTimestamp: "2023-09-04T07:44:01Z" generation: 1 labels: app: httpd-24 app.kubernetes.io/component: httpd-24 app.kubernetes.io/instance: httpd-24 name: httpd-24 namespace: wxg resourceVersion: "115441" uid: 909d0c4e-180c-4f88-8fb5-93c927839903 spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: deployment: httpd-24 strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: openshift.io/generated-by: OpenShiftNewApp creationTimestamp: null labels: deployment: httpd-24 spec: containers: - image: ' ' imagePullPolicy: IfNotPresent name: httpd-24 ports: - containerPort: 8080 protocol: TCP - containerPort: 8443 protocol: TCP resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 status: conditions: - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: Created new replica set "httpd-24-7f6b55cc85" reason: NewReplicaSetCreated status: "True" type: Progressing - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: Deployment does not have minimum availability. reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: 'Pod "httpd-24-7f6b55cc85-pvvgt" is invalid: spec.containers[0].image: Invalid value: " ": must not have leading or trailing whitespace' reason: FailedCreate status: "True" type: ReplicaFailure observedGeneration: 1 unavailableReplicas: 1 kind: List metadata:
Expected results:
Should set spec.containers[0].image to registry.redhat.io/ubi8/httpd-24:latest
Additional info:
Currently the upgrade feature agent is disabled by default and enabled explicitly only for the SaaS environment. This ticket is about enabling it by default also for ACM.
Deploying a helm chart that features a values.schema.json using either 2019-09 or 2020-20 (latest) revisions of the JSON-Schema results in the UI hanging on create with three dots loading... This is not the case if YAML view is used, since I suppose this view is not trying to be clever and let Helm validate the chart values against the schema itself.
Reproduced in 4.13, probably affects other versions as well.
100%
1. Go to Helm tab.
2. Click create in top right and select Repository
3. Paste following into YAML view and click Create:
apiVersion: helm.openshift.io/v1beta1 kind: ProjectHelmChartRepository metadata: name: reproducer spec: connectionConfig: url: 'https://raw.githubusercontent.com/tumido/helm-backstage/blog2'
4. Go to the Helm tab again (if redirected elsewhere)
5. Click create in top right and select Helm Release
6. In catalog filter select Chart repositories: Reproducer
7. Click on the single tile available (Backstage) and click Create
8. Switch to Form view
9. Leave default values and click Create
10. Stare at the always loading screen that never proceeds further.
It installs and deploys the chart
This is caused by a JSON Schema containing a $schema key pointing which revision of the JSON Schema standard should be used:
{ "$schema": "https://json-schema.org/draft/2020-12/schema", }
I've managed to trace this back to this react-jsonschema-form issue:
https://github.com/rjsf-team/react-jsonschema-form/issues/2241
It seems the library used here for validation doesn't support 2019-09 draft and the most current revision 2020-20 revision.
It happens only if the chart follows the JSON Schema standard and declares the revision properly.
Workarounds:
IMO best solution:
Helm form renderer should NOT do any validation, since it can't handle the schema properly. Instead, it should leave this job to the Helm backend. Helm validates the values against the schema when installing the chart anyways. The YAML view also does no validation. That one seems to do the job properly.
Currently, there is no formal requirement for charts admitted to the helm curated catalog saying that the most recent JSON Schema revision is 4 years old and later 2 revisions are not supported.
Also, the Form UI should not just hang on submit. Instead, it should at least fail gracefully.
Related to:
https://github.com/janus-idp/helm-backstage/issues/64#issuecomment-1587678319
CI is flaky because of test failures such as the following:
{ fail [github.com/openshift/origin/test/extended/oauth/requestheaders.go:218]: full response header: HTTP/1.1 403 Forbidden Content-Length: 192 Audit-Id: f6026f9b-06c5-4b4a-9414-8dc5c681b45a Cache-Control: no-cache, no-store, max-age=0, must-revalidate Content-Type: application/json Date: Tue, 08 Aug 2023 11:26:35 GMT Expires: 0 Pragma: no-cache Referrer-Policy: strict-origin-when-cross-origin X-Content-Type-Options: nosniff X-Dns-Prefetch-Control: off X-Frame-Options: DENY X-Xss-Protection: 1; mode=block {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"forbidden: User \"system:anonymous\" cannot get path \"/metrics\"","reason":"Forbidden","details":{},"code":403} Expected <string>: 403 Forbidden to contain substring <string>: 401 Unauthorized Ginkgo exit error 1: exit with code 1}
This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/380/pull-ci-openshift-openshift-apiserver-master-e2e-aws-ovn-serial/1688848417708576768. Search.ci has other similar failures.
I have seen this in 4.14 CI jobs and 4.13 CI jobs.
Presently, search.ci shows the following stats for the past 14 days:
Found in 2.41% of runs (4.36% of failures) across 1078 total runs and 58 jobs (55.38% failed) pull-ci-openshift-openshift-apiserver-master-e2e-aws-ovn-serial (all) - 25 runs, 40% failed, 20% of failures match = 8% impact openshift-cluster-network-operator-1874-nightly-4.14-e2e-aws-ovn-serial (all) - 42 runs, 67% failed, 14% of failures match = 10% impact pull-ci-openshift-kubernetes-master-e2e-aws-ovn-serial (all) - 59 runs, 54% failed, 6% of failures match = 3% impact pull-ci-openshift-origin-master-e2e-aws-ovn-serial (all) - 434 runs, 66% failed, 2% of failures match = 1% impact pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-serial (all) - 55 runs, 49% failed, 7% of failures match = 4% impact pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-serial (all) - 60 runs, 58% failed, 3% of failures match = 2% impact pull-ci-operator-framework-operator-marketplace-master-e2e-aws-ovn-serial (all) - 24 runs, 38% failed, 22% of failures match = 8% impact pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-serial (all) - 81 runs, 58% failed, 4% of failures match = 2% impact pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-serial (all) - 35 runs, 46% failed, 13% of failures match = 6% impact rehearse-41872-pull-ci-openshift-ovn-kubernetes-release-4.14-e2e-aws-ovn-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial (all) - 72 runs, 49% failed, 3% of failures match = 1% impact pull-ci-openshift-cluster-kube-apiserver-operator-release-4.13-e2e-aws-ovn-serial (all) - 4 runs, 75% failed, 33% of failures match = 25% impact pull-ci-openshift-cluster-dns-operator-master-e2e-aws-ovn-serial (all) - 19 runs, 63% failed, 8% of failures match = 5% impact
1. Post a PR and have bad luck.
2. Check search.ci using the link above.
CI fails.
CI passes, or fails on some other test failure.
Context:
In 4.14 kubelet config from MCO payload comes with --external, which means node.cloudprovider.kubernetes.io/uninitialized taint is set preventing workloads from being scheduled and only cleaned up by the external cloud provider.
This has come as a result of AWS removing their in-tree provider implementation for K8s 1.27
DoD:
We need to let the CPO run the AWS external cloud provider.
Description of problem:
023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health [-] Component KuryrPortHandler is dead. Last caught exception below: openstack.exceptions.InvalidRequest: Request requires an ID but none was found 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health Traceback (most recent call last): 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 169, in on_finalize 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health pod = self.k8s.get(f"{constants.K8S_API_NAMESPACES}" 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/k8s_client.py", line 121, in get 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health self._raise_from_response(response) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/k8s_client.py", line 99, in _raise_from_response 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health raise exc.K8sResourceNotFound(response.text) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health kuryr_kubernetes.exceptions.K8sResourceNotFound: Resource not found: '{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \\"mygov-tuo-microservice-dev2-59fffbc58c-l5b79\\" not found","reason":"NotFound","details":{"name":"mygov-tuo-microservice-dev2-59fffbc58c-l5b79","kind":"pods"},"code":404}\n' 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health During handling of the above exception, another exception occurred: 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health Traceback (most recent call last): 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/logging.py", line 38, in __call__ 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health self._handler(event, *args, **kwargs) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 85, in __call__ 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health self._handler(event, *args, retry_info=info, **kwargs) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 98, in __call__ 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health self.on_finalize(obj, *args, **kwargs) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 184, in on_finalize 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health pod = self._mock_cleanup_pod(kuryrport_crd) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 160, in _mock_cleanup_pod 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health host_ip = utils.get_parent_port_ip(port_id) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/utils.py", line 830, in get_parent_port_ip 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health parent_port = os_net.get_port(port_id) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/openstack/network/v2/_proxy.py", line 1987, in get_port 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health return self._get(_port.Port, port) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 48, in check 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health return method(self, expected, actual, *args, **kwargs) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 513, in _get 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health resource_type=resource_type.__name__, value=value)) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1472, in fetch 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health base_path=base_path) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/openstack/network/v2/_base.py", line 26, in _prepare_request 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health base_path=base_path, params=params) 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1156, in _prepare_request 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health "Request requires an ID but none was found") 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health openstack.exceptions.InvalidRequest: Request requires an ID but none was found 2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health 2023-04-20 02:08:09.918 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping 2023-04-20 02:08:09.919 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworks' 2023-04-20 02:08:10.026 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/machine.openshift.io/v1beta1/machines' 2023-04-20 02:08:10.152 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/pods' 2023-04-20 02:08:10.174 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/networking.k8s.io/v1/networkpolicies' 2023-04-20 02:08:10.857 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/namespaces' 2023-04-20 02:08:10.877 1 WARNING kuryr_kubernetes.controller.drivers.utils [-] Namespace dev-health-air-ids not yet ready: kuryr_kubernetes.exceptions.K8sResourceNotFound: Resource not found: '{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kuryrnetworks.openstack.org \\"dev-health-air-ids\\" not found","reason":"NotFound","details":{"name":"dev-health-air-ids","group":"openstack.org","kind":"kuryrnetworks"},"code":404}\n' 2023-04-20 02:08:11.024 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/services' 2023-04-20 02:08:11.078 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/endpoints' 2023-04-20 02:08:11.170 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrports' 2023-04-20 02:08:11.344 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworkpolicies' 2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrloadbalancers' 2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.watcher [-] No remaining active watchers, Exiting... 2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a pod. 2. Stop kuryr-controller. 3. Delete the pod and the finalizer on it. 4. Delete pod's subport. 5. Start the controller.
Actual results:
Crash
Expected results:
Port cleaned up normally.
Additional info:
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/75
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In https://github.com/openshift/cluster-baremetal-operator/blob/master/provisioning/utils.go#L65 we reference .PlatformStatus.BareMetal.APIServerInternalIP attribute from the config API. Meanwhile, a recent change https://github.com/openshift/api/commit/51f399230d604fa013c2bb341040c4ad126e7309 deprecated this field in favour of .APIServerInternalIPs (note plural), this was done to better suit dual stack case. We need to update the code (trivial) along with vendor dependencies (openshift/api needs a bump to version equal or later to the one including the commit referenced above). Likely there will be code changes required in CBO to adopt to the newer API package. Slack threads for reference: https://app.slack.com/client/T027F3GAJ/C01RJHA6BRC/thread/C01RJHA6BRC-1661416223.353009 (vendor dependency update) openshift/api change: https://coreos.slack.com/archives/C01RJHA6BRC/p1660573560434409?thread_ts=1660229723.998839&cid=C01RJHA6BRC IMPORTANT NOTE: there is an in-flight PR which is making changes to the CBO code fetching the VIP: https://github.com/openshift/cluster-baremetal-operator/pull/285. Work done to address this bug needs to be stacked on top of this to avoid duplication of effort (the easiest way is to work on the code from the in-flight PR285 and merge once PR285 merges)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/95
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Bugs are required for all 4.14 merges right now due to instability. We need to bump the version of the cvo so that the version is consistent with the cluster being installed.
After running several scale tests on a large cluster (252 workers), etcd ran out of space and became unavailable.
These tests consisted of running our node-density workload (Creates more than 50k pause pods) and cluster-density 4k several times (creates 4k namespaces with https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner#cluster-density-variables).
The actions above leaded etcd peers to run out of free space in their 4GiB PVCs presenting the following error trace
{"level":"warn","ts":"2023-03-31T09:50:57.532Z","caller":"rafthttp/http.go:271","msg":"failed to save incoming database snapshot","local-member-id":"b14198cd7f0eebf1","remote-snapshot-sender-id":"a4e894c3f4af1379","incoming-snapshot-index ":19490191,"error":"write /var/lib/data/member/snap/tmp774311312: no space left on device"}
Etcd uses 4GiB PVCs to store its data, which seems to be insufficient for this scenario. In addition, unlike not-hypershift clusters we're not applying any periodic database defragmentation (this is done by cluster-etcd-operator) that could lead to a higher database size
The graph below represents the metrics etcd_mvcc_db_total_size_in_bytes and etcd_mvcc_db_total_size_in_use_in_byte
Description of problem:
In our IBM Cloud use-case of RHCOS, we are seeing 4.13 RHCOS nodes failing to properly bootstrap to a HyperShift 4.13 control plane. RHCOS worker node kubelet is failing with "failed to construct kubelet dependencies: unable to load client CA file /etc/kubernetes/kubelet-ca.crt: open /etc/kubernetes/kubelet-ca.crt: no such file or directory".
Version-Release number of selected component (if applicable):
4.13.0-rc.6
How reproducible:
100%
Steps to Reproduce:
1. Create a HyperShift 4.13 control plane 2. Boot a RHCOS host outside of cluster 3. After initial RHCOS boot, fetch ignition from control plane 4. Attempt to bootstrap to cluster via `machine-config-daemon firstboot-complete-machineconfig`
Actual results:
Kubelet service fails with "failed to construct kubelet dependencies: unable to load client CA file /etc/kubernetes/kubelet-ca.crt: open /etc/kubernetes/kubelet-ca.crt: no such file or directory".
Expected results:
RHCOS worker node to properly bootstrap to HyperShift control plane. This has been the supported bootstrapping flow for releases <4.13.
Additional info:
References: - https://redhat-internal.slack.com/archives/C01C8502FMM/p1682968210631419 - https://github.com/openshift/machine-config-operator/pull/3575 - https://github.com/openshift/machine-config-operator/pull/3654
This is a clone of issue OCPBUGS-18907. The following is the description of the original issue:
—
Description of problem:
From on to https://issues.redhat.com/browse/OCPBUGS-17827 jiezhao-mac:hypershift jiezhao$ oc get hostedcluster -n clusters NAME VERSION KUBECONFIG PROGRESS AVAILABLE PROGRESSING MESSAGE jie-test 4.14.0-0.nightly-2023-09-12-024050 jie-test-admin-kubeconfig Completed True False The hosted control plane is available jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get pods -n clusters-jie-test | grep router router-78d47f4c69-2mvbp 1/1 Running 0 68m jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get pods router-78d47f4c69-2mvbp -n clusters-jie-test -ojsonpath='{.metadata.labels}' | jq { "app": "private-router", "hypershift.openshift.io/hosted-control-plane": "clusters-jie-test", "hypershift.openshift.io/request-serving-component": "true", "pod-template-hash": "78d47f4c69" } jiezhao-mac:hypershift jiezhao$ oc get networkpolicy management-kas -n clusters-jie-test NAME POD-SELECTOR AGE management-kas !hypershift.openshift.io/need-management-kas-access,name notin (aws-ebs-csi-driver-operator) 76m jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get networkpolicy management-kas -n clusters-jie-test -o yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: annotations: hypershift.openshift.io/cluster: clusters/jie-test creationTimestamp: "2023-09-12T14:43:13Z" generation: 1 name: management-kas namespace: clusters-jie-test resourceVersion: "54049" uid: 72288fed-a1f6-4dc9-bb63-981d7cdd479f spec: egress: - to: - podSelector: {} - to: - ipBlock: cidr: 0.0.0.0/0 except: - 10.0.46.47/32 - 10.0.7.159/32 - 10.0.77.20/32 - 10.128.0.0/14 - ports: - port: 5353 protocol: UDP - port: 5353 protocol: TCP to: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: openshift-dns podSelector: matchExpressions: - key: hypershift.openshift.io/need-management-kas-access operator: DoesNotExist - key: name operator: NotIn values: - aws-ebs-csi-driver-operator policyTypes: - Egress status: {} jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get endpoints -n default kubernetes NAME ENDPOINTS AGE kubernetes 10.0.46.47:6443,10.0.7.159:6443,10.0.77.20:6443 150m jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get endpoints -n default kubernetes -o yaml apiVersion: v1 kind: Endpoints metadata: creationTimestamp: "2023-09-12T13:32:47Z" labels: endpointslice.kubernetes.io/skip-mirror: "true" name: kubernetes namespace: default resourceVersion: "31961" uid: bc170a67-018f-4490-a18c-811ebd3f3676 subsets: - addresses: - ip: 10.0.46.47 - ip: 10.0.7.159 - ip: 10.0.77.20 ports: - name: https port: 6443 protocol: TCP jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get endpoints -n default kubernetes -ojsonpath='{.subsets[].addresses[].ip}{"\n"}' 10.0.46.47 jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get endpoints -n default kubernetes -ojsonpath='{.subsets[].ports[].port}{"\n"}' 6443 jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc project clusters-jie-test Now using project "clusters-jie-test" on server "https://api.jiezhao-091201.qe.devcluster.openshift.com:6443". jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc -n clusters-jie-test rsh pod/router-78d47f4c69-2mvbp curl --connect-timeout 2 -Iks https://10.0.46.47:6443 -v * Rebuilt URL to: https://10.0.46.47:6443/ * Trying 10.0.46.47... * TCP_NODELAY set * Connected to 10.0.46.47 (10.0.46.47) port 6443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Request CERT (13): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.3 (OUT), TLS handshake, [no content] (0): * TLSv1.3 (OUT), TLS handshake, Certificate (11): * TLSv1.3 (OUT), TLS handshake, [no content] (0): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server accepted to use h2 * Server certificate: * subject: CN=172.30.0.1 * start date: Sep 12 13:35:51 2023 GMT * expire date: Oct 12 13:35:52 2023 GMT * issuer: OU=openshift; CN=kube-apiserver-service-network-signer * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. * Using HTTP2, server supports multi-use * Connection state changed (HTTP/2 confirmed) * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 * TLSv1.3 (OUT), TLS app data, [no content] (0): * TLSv1.3 (OUT), TLS app data, [no content] (0): * TLSv1.3 (OUT), TLS app data, [no content] (0): * Using Stream ID: 1 (easy handle 0x55c5c46cb990) * TLSv1.3 (OUT), TLS app data, [no content] (0): > HEAD / HTTP/2 > Host: 10.0.46.47:6443 > User-Agent: curl/7.61.1 > Accept: */* > * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.3 (IN), TLS app data, [no content] (0): * Connection state changed (MAX_CONCURRENT_STREAMS == 2000)! * TLSv1.3 (OUT), TLS app data, [no content] (0): * TLSv1.3 (IN), TLS app data, [no content] (0): * TLSv1.3 (IN), TLS app data, [no content] (0): * TLSv1.3 (IN), TLS app data, [no content] (0): < HTTP/2 403 HTTP/2 403 < audit-id: 82d5f3f7-6e5b-4bb5-b846-54df09aefb54 audit-id: 82d5f3f7-6e5b-4bb5-b846-54df09aefb54 < cache-control: no-cache, private cache-control: no-cache, private < content-type: application/json content-type: application/json < strict-transport-security: max-age=31536000; includeSubDomains; preload strict-transport-security: max-age=31536000; includeSubDomains; preload < x-content-type-options: nosniff x-content-type-options: nosniff < x-kubernetes-pf-flowschema-uid: 6edd6532-2d15-4d8d-9cea-4dcce99b881f x-kubernetes-pf-flowschema-uid: 6edd6532-2d15-4d8d-9cea-4dcce99b881f < x-kubernetes-pf-prioritylevel-uid: 4115bb59-a78d-42ab-9136-37529cf107e1 x-kubernetes-pf-prioritylevel-uid: 4115bb59-a78d-42ab-9136-37529cf107e1 < content-length: 218 content-length: 218 < date: Tue, 12 Sep 2023 16:05:02 GMT date: Tue, 12 Sep 2023 16:05:02 GMT < * Connection #0 to host 10.0.46.47 left intact jiezhao-mac:hypershift jiezhao$
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-19059. The following is the description of the original issue:
—
Description of problem:
baremetal 4.14.0-rc.0 ipv6 sno cluster, login as admin user to admin console, there is not Observe menu on the left navigation bar, see picture, https://drive.google.com/file/d/13RAXPxtKhAElN9xf8bAmLJa0GI8pP0fH/view?usp=sharing, monitoring-plugin status is Failed, see: https://drive.google.com/file/d/1YsSaGdLT4bMn-6E-WyFWbOpwvDY4t6na/view?usp=sharing, error is
Failed to get a valid plugin manifest from /api/plugins/monitoring-plugin/ r: Bad Gateway
checked console logs, 9443: connect: connection refused
$ oc -n openshift-console logs console-6869f8f4f4-56mbj ... E0915 12:50:15.498589 1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::f735]:9443: connect: connection refused 2023/09/15 12:50:15 http: panic serving [fd01:0:0:1::2]:39156: runtime error: invalid memory address or nil pointer dereference goroutine 183760 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1854 +0xbf panic({0x3259140, 0x4fcc150}) /usr/lib/golang/src/runtime/panic.go:890 +0x263 github.com/openshift/console/pkg/plugins.(*PluginsHandler).proxyPluginRequest(0xc0003b5760, 0x2?, {0xc0009bc7d1, 0x11}, {0x3a41fa0, 0xc0002f6c40}, 0xb?) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:165 +0x582 github.com/openshift/console/pkg/plugins.(*PluginsHandler).HandlePluginAssets(0xaa00000000000010?, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7500) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:147 +0x26d github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func23({0x3a41fa0?, 0xc0002f6c40?}, 0x7?) /go/src/github.com/openshift/console/pkg/server/server.go:604 +0x33 net/http.HandlerFunc.ServeHTTP(...) /usr/lib/golang/src/net/http/server.go:2122 github.com/openshift/console/pkg/server.authMiddleware.func1(0xc0001f7500?, {0x3a41fa0?, 0xc0002f6c40?}, 0xd?) /go/src/github.com/openshift/console/pkg/server/middleware.go:25 +0x31 github.com/openshift/console/pkg/server.authMiddlewareWithUser.func1({0x3a41fa0, 0xc0002f6c40}, 0xc0001f7500) /go/src/github.com/openshift/console/pkg/server/middleware.go:81 +0x46c net/http.HandlerFunc.ServeHTTP(0x5120938?, {0x3a41fa0?, 0xc0002f6c40?}, 0x7ffb6ea27f18?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.StripPrefix.func1({0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400) /usr/lib/golang/src/net/http/server.go:2165 +0x332 net/http.HandlerFunc.ServeHTTP(0xc001102c00?, {0x3a41fa0?, 0xc0002f6c40?}, 0xc000655a00?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.(*ServeMux).ServeHTTP(0x34025e0?, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400) /usr/lib/golang/src/net/http/server.go:2500 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x3a41fa0, 0xc0002f6c40}, 0x3305040?) /go/src/github.com/openshift/console/pkg/server/middleware.go:128 +0x3af net/http.HandlerFunc.ServeHTTP(0x0?, {0x3a41fa0?, 0xc0002f6c40?}, 0x11db52e?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.serverHandler.ServeHTTP({0xc0008201e0?}, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400) /usr/lib/golang/src/net/http/server.go:2936 +0x316 net/http.(*conn).serve(0xc0009b4120, {0x3a43e70, 0xc001223500}) /usr/lib/golang/src/net/http/server.go:1995 +0x612 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3089 +0x5ed I0915 12:50:24.267777 1 handlers.go:118] User settings ConfigMap "user-settings-4b4c2f4d-159c-4358-bba3-3d87f113cd9b" already exist, will return existing data. I0915 12:50:24.267813 1 handlers.go:118] User settings ConfigMap "user-settings-4b4c2f4d-159c-4358-bba3-3d87f113cd9b" already exist, will return existing data. E0915 12:50:30.155515 1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::f735]:9443: connect: connection refused 2023/09/15 12:50:30 http: panic serving [fd01:0:0:1::2]:42990: runtime error: invalid memory address or nil pointer dereference
9443 port is Connection refused
$ oc -n openshift-monitoring get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 6/6 Running 6 3d22h fd01:0:0:1::564 sno-2 <none> <none> cluster-monitoring-operator-6cb777d488-nnpmx 1/1 Running 4 7d16h fd01:0:0:1::12 sno-2 <none> <none> kube-state-metrics-dc5f769bc-p97m7 3/3 Running 12 7d16h fd01:0:0:1::3b sno-2 <none> <none> monitoring-plugin-85bfb98485-d4g5x 1/1 Running 4 7d16h fd01:0:0:1::55 sno-2 <none> <none> node-exporter-ndnnj 2/2 Running 8 7d16h 2620:52:0:165::41 sno-2 <none> <none> openshift-state-metrics-78df59b4d5-j6r5s 3/3 Running 12 7d16h fd01:0:0:1::3a sno-2 <none> <none> prometheus-adapter-6f86f7d8f5-ttflf 1/1 Running 0 4h23m fd01:0:0:1::b10c sno-2 <none> <none> prometheus-k8s-0 6/6 Running 6 3d22h fd01:0:0:1::566 sno-2 <none> <none> prometheus-operator-7c94855989-csts2 2/2 Running 8 7d16h fd01:0:0:1::39 sno-2 <none> <none> prometheus-operator-admission-webhook-7bb64b88cd-bvq8m 1/1 Running 4 7d16h fd01:0:0:1::37 sno-2 <none> <none> thanos-querier-5bbb764599-vlztq 6/6 Running 6 3d22h fd01:0:0:1::56a sno-2 <none> <none> $ oc -n openshift-monitoring get svc monitoring-plugin NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE monitoring-plugin ClusterIP fd02::f735 <none> 9443/TCP 7d16h $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -v 'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq * Trying fd02::f735... * TCP_NODELAY set * connect to fd02::f735 port 9443 failed: Connection refused * Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused * Closing connection 0 curl: (7) Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused command terminated with exit code 7
no such issue in other 4.14.0-rc.0 ipv4 cluster, but issue reproduced on other 4.14.0-rc.0 ipv6 cluster.
4.14.0-rc.0 ipv4 cluster,
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-rc.0 True False 20m Cluster version is 4.14.0-rc.0 $ oc -n openshift-monitoring get pod -o wide | grep monitoring-plugin monitoring-plugin-85bfb98485-nh428 1/1 Running 0 4m 10.128.0.107 ci-ln-pby4bj2-72292-l5q8v-master-0 <none> <none> $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k 'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq ... { "name": "monitoring-plugin", "version": "1.0.0", "displayName": "OpenShift console monitoring plugin", "description": "This plugin adds the monitoring UI to the OpenShift web console", "dependencies": { "@console/pluginAPI": "*" }, "extensions": [ { "type": "console.page/route", "properties": { "exact": true, "path": "/monitoring", "component": { "$codeRef": "MonitoringUI" } } }, ...
meet issue "9443: Connection refused" in 4.14.0-rc.0 ipv6 cluster(launched cluster-bot cluster: launch 4.14.0-rc.0 metal,ipv6) and login console
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-rc.0 True False 44m Cluster version is 4.14.0-rc.0 $ oc -n openshift-monitoring get pod -o wide | grep monitoring-plugin monitoring-plugin-bd6ffdb5d-b5csk 1/1 Running 0 53m fd01:0:0:4::b worker-0.ostest.test.metalkube.org <none> <none> monitoring-plugin-bd6ffdb5d-vhtpf 1/1 Running 0 53m fd01:0:0:5::9 worker-2.ostest.test.metalkube.org <none> <none> $ oc -n openshift-monitoring get svc monitoring-plugin NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE monitoring-plugin ClusterIP fd02::402d <none> 9443/TCP 59m $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -v 'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq * Trying fd02::402d... * TCP_NODELAY set * connect to fd02::402d port 9443 failed: Connection refused * Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused * Closing connection 0 curl: (7) Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused command terminated with exit code 7$ oc -n openshift-console get pod | grep console console-5cffbc7964-7ljft 1/1 Running 0 56m console-5cffbc7964-d864q 1/1 Running 0 56m$ oc -n openshift-console logs console-5cffbc7964-7ljft ... E0916 14:34:16.330117 1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::402d]:9443: connect: connection refused 2023/09/16 14:34:16 http: panic serving [fd01:0:0:4::2]:37680: runtime error: invalid memory address or nil pointer dereference goroutine 3985 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1854 +0xbf panic({0x3259140, 0x4fcc150}) /usr/lib/golang/src/runtime/panic.go:890 +0x263 github.com/openshift/console/pkg/plugins.(*PluginsHandler).proxyPluginRequest(0xc0008f6780, 0x2?, {0xc000665211, 0x11}, {0x3a41fa0, 0xc0009221c0}, 0xb?) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:165 +0x582 github.com/openshift/console/pkg/plugins.(*PluginsHandler).HandlePluginAssets(0xfe00000000000010?, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d600) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:147 +0x26d github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func23({0x3a41fa0?, 0xc0009221c0?}, 0x7?) /go/src/github.com/openshift/console/pkg/server/server.go:604 +0x33 net/http.HandlerFunc.ServeHTTP(...) /usr/lib/golang/src/net/http/server.go:2122 github.com/openshift/console/pkg/server.authMiddleware.func1(0xc000d8d600?, {0x3a41fa0?, 0xc0009221c0?}, 0xd?) /go/src/github.com/openshift/console/pkg/server/middleware.go:25 +0x31 github.com/openshift/console/pkg/server.authMiddlewareWithUser.func1({0x3a41fa0, 0xc0009221c0}, 0xc000d8d600) /go/src/github.com/openshift/console/pkg/server/middleware.go:81 +0x46c net/http.HandlerFunc.ServeHTTP(0xc000653830?, {0x3a41fa0?, 0xc0009221c0?}, 0x7f824506bf18?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.StripPrefix.func1({0x3a41fa0, 0xc0009221c0}, 0xc000d8d500) /usr/lib/golang/src/net/http/server.go:2165 +0x332 net/http.HandlerFunc.ServeHTTP(0xc00007e800?, {0x3a41fa0?, 0xc0009221c0?}, 0xc000b2da00?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.(*ServeMux).ServeHTTP(0x34025e0?, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d500) /usr/lib/golang/src/net/http/server.go:2500 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x3a41fa0, 0xc0009221c0}, 0x3305040?) /go/src/github.com/openshift/console/pkg/server/middleware.go:128 +0x3af net/http.HandlerFunc.ServeHTTP(0x0?, {0x3a41fa0?, 0xc0009221c0?}, 0x11db52e?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.serverHandler.ServeHTTP({0xc000db9b00?}, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d500) /usr/lib/golang/src/net/http/server.go:2936 +0x316 net/http.(*conn).serve(0xc000653680, {0x3a43e70, 0xc000676f30}) /usr/lib/golang/src/net/http/server.go:1995 +0x612 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3089 +0x5ed
Version-Release number of selected component (if applicable):
baremetal 4.14.0-rc.0 ipv6 sno cluster, $ token=`oc create token prometheus-k8s -n openshift-monitoring` $ $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=virt_platform' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "virt_platform", "baseboard_manufacturer": "Dell Inc.", "baseboard_product_name": "01J4WF", "bios_vendor": "Dell Inc.", "bios_version": "1.10.2", "container": "kube-rbac-proxy", "endpoint": "https", "instance": "sno-2", "job": "node-exporter", "namespace": "openshift-monitoring", "pod": "node-exporter-ndnnj", "prometheus": "openshift-monitoring/k8s", "service": "node-exporter", "system_manufacturer": "Dell Inc.", "system_product_name": "PowerEdge R750", "system_version": "Not Specified", "type": "none" }, "value": [ 1694785092.664, "1" ] } ] } }
How reproducible:
only seen on this cluster
Steps to Reproduce:
1. see the description 2. 3.
Actual results:
no Observe menu on admin console, monitoring-plugin is failed
Expected results:
no error
Description of problem:
In 7 day's reliability test, kube-apiserver's memory usage keep increasing. Max is over 3GB. In our 4.12 test result, the kube-apiserver's memory usage was stable around 1.7 GB and not keep increasing. I'll redo the test on a 4.12.0 build to see if I can reproduce this issue. I'll do a longer than 7 days test to see how high the memory can grow up. About Reliability Test https://github.com/openshift/svt/tree/master/reliability-v2
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-14-053612
How reproducible:
Always
Steps to Reproduce:
1. Install an AWS cluster with m5.xlarge type 2. Run reliability test for 7 days Reliability Test Configuration example: https://github.com/openshift/svt/tree/master/reliability-v2#groups-and-tasks-1 Config used in this test: admin: 1 user dev-test: 15 users dev-prod: 1 user 3. Use dittybopper dashboard to monitor the kube-apiserver's memory usage
Actual results:
kube-apiserver's memory usage keep increasing. Max is over 3GB
Expected results:
kube-apiserver's memory usage should not keep increasing
Additional info:
Screenshots are uploaded to shared folder OCPBUGS-10829 - Google Drive
413-kube-apiserver-memory.png 413-api-performance-last2d.png - test was stopped on [2023-03-24 04:21:10 UTC] 412-kube-apiserver-memory.png must-gather.local.525817950490593011.tar.gz - 4.13 cluster's must gather
Console UI is broken due to patternfly/react-core version changed to
4.276.11 from 4.276.8
Description of problem:
The hypershift_hostedclusters_failure_conditions metric produced by the HyperShift operator does not report a value of 0 for conditions that no longer apply. The result is that if a hostedcluster had a failure condition at a given point, but that condition has gone away, the metric still reports a count for that condition.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create a HostedCluster, watch the hypershift_hostedclusters_failure_conditions metric as failure conditions occur. 2. 3.
Actual results:
A cluster count of 1 with a failure condition is reported even if the failure condition no longer applies.
Expected results:
Once failure conditions no longer apply, 0 clusters with those conditions should be reported.
Additional info:
The metric should report an accurate count for each possible failure condition of all clusters at any given time.
Description of problem:
When adding a repository url that contains hyphens in the <owner> part of the url
(<https://github.com/owner/url> - eg https://github.com/redhat-developer/s2i-dotnetcore-ex.git), then create button stays disabled and validation errors are not presented in the UI.
Version-Release number of selected component (if applicable):
4.9
How reproducible:
Always
Steps to Reproduce:
1. Go to Developer -> Add -> Import from Git page
2. use the repo url https://github.com/redhat-developer/s2i-dotnetcore-ex.git
3. add `/app` in the context dir under advanced git options.
Actual results:
1Once the builder image is detected, then Create button is disabled but no errors in the form. When the user touches the name field and then name validation error message is shown even if the suggested name is valid.
Expected results:
After detecting the builder image, the create button should be enabled.
Additional info:
Description of problem:
Authorization by OpenShift Container Platform 4 is not working as expected, when using system:serviceaccounts Group in the ClusterRoleBinding. Here, one would assume that every serviceAccount would be granted the permissions to access the defined resources but actually access is denied. $ curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data "@/tmp/post.json" https://api.<url>:6443/apis/authorization.k8s.io/v1/subjectaccessreviews { "kind": "SubjectAccessReview", "apiVersion": "authorization.k8s.io/v1", "metadata": { "creationTimestamp": null, "managedFields": [ { "manager": "curl", "operation": "Update", "apiVersion": "authorization.k8s.io/v1", "time": "2023-03-13T09:17:45Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:spec": { "f:resourceAttributes": { ".": {}, "f:group": {}, "f:name": {}, "f:namespace": {}, "f:resource": {}, "f:verb": {} }, "f:user": {} } } } ] }, "spec": { "resourceAttributes": { "namespace": "project-100", "verb": "use", "group": "sharedresource.openshift.io", "resource": "sharedsecrets", "name": "shared-subscription" }, "user": "system:serviceaccount:project-100:builder" }, "status": { "allowed": false } } When specifying the serviceAccount in the ClusterRoleBinding access is granted: $ oc get clusterrolebinding shared-secret-cluster-role-binding -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRoleBinding","metadata":{"annotations":{},"name":"shared-secret-cluster-role-binding"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"shared-secret-cluster-role"},"subjects":[{"apiGroup":"rbac.authorization.k8s.io","kind":"Group","name":"system:serviceaccounts"}]} creationTimestamp: "2023-03-13T08:59:46Z" name: shared-secret-cluster-role-binding resourceVersion: "1575464" uid: dd11825d-834a-4807-ab82-30dc0a415985 roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: shared-secret-cluster-role subjects: - apiGroup: rbac.authorization.k8s.io kind: Group name: system:serviceaccounts - kind: ServiceAccount name: builder namespace: project-101 $ curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data "@/tmp/post.json" https://api.<url>:6443/apis/authorization.k8s.io/v1/subjectaccessreviews { "kind": "SubjectAccessReview", "apiVersion": "authorization.k8s.io/v1", "metadata": { "creationTimestamp": null, "managedFields": [ { "manager": "curl", "operation": "Update", "apiVersion": "authorization.k8s.io/v1", "time": "2023-03-13T09:16:47Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:spec": { "f:resourceAttributes": { ".": {}, "f:group": {}, "f:name": {}, "f:namespace": {}, "f:resource": {}, "f:verb": {} }, "f:user": {} } } } ] }, "spec": { "resourceAttributes": { "namespace": "project-101", "verb": "use", "group": "sharedresource.openshift.io", "resource": "sharedsecrets", "name": "shared-subscription" }, "user": "system:serviceaccount:project-101:builder" }, "status": { "allowed": true, "reason": "RBAC: allowed by ClusterRoleBinding \"shared-secret-cluster-role-binding\" of ClusterRole \"shared-secret-cluster-role\" to ServiceAccount \"builder/project-101\"" } } Both namespaces exist and have the serviceAccount automatically created. $ oc get sa -n project-100 NAME SECRETS AGE builder 1 11m default 1 11m deployer 1 11m $ oc get sa -n project-101 NAME SECRETS AGE builder 1 4m1s default 1 4m1s deployer 1 4m The difference is only how authorization is granted. For project-101 the serviceAccount is explicitly granted while for project-100 authorization should be granted via Group called system:serviceaccounts
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.12.5
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4.12 2. Create SharedSecret CRD using oc apply -f https://raw.githubusercontent.com/openshift/api/master/sharedresource/v1alpha1/0000_10_sharedsecret.crd.yaml 3. Create SharedSecret resource: $ oc get sharedsecret shared-subscription -o yaml apiVersion: sharedresource.openshift.io/v1alpha1 kind: SharedSecret metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"sharedresource.openshift.io/v1alpha1","kind":"SharedSecret","metadata":{"annotations":{},"name":"shared-subscription"},"spec":{"secretRef":{"name":"etc-pki-entitlement","namespace":"openshift-config-managed"}}} creationTimestamp: "2023-03-13T08:54:48Z" generation: 1 name: shared-subscription resourceVersion: "1567499" uid: 15c350aa-0de1-4a02-b876-9b822ba0afe5 spec: secretRef: name: etc-pki-entitlement namespace: openshift-config-managed 4. Create ClusterRole to grant access to SharedSecret: $ oc get clusterrole shared-secret-cluster-role -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"name":"shared-secret-cluster-role"},"rules":[{"apiGroups":["sharedresource.openshift.io"],"resourceNames":["shared-subscription"],"resources":["sharedsecrets"],"verbs":["use"]}]} creationTimestamp: "2023-03-13T08:57:24Z" name: shared-secret-cluster-role resourceVersion: "1568481" uid: 99324722-ac62-4bb8-a7fe-7ac915393e19 rules: - apiGroups: - sharedresource.openshift.io resourceNames: - shared-subscription resources: - sharedsecrets verbs: - use 5. Create ClusterRoleBinding to access SharedSecret $ oc get clusterrolebinding shared-secret-cluster-role-binding -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRoleBinding","metadata":{"annotations":{},"name":"shared-secret-cluster-role-binding"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"shared-secret-cluster-role"},"subjects":[{"apiGroup":"rbac.authorization.k8s.io","kind":"Group","name":"system:serviceaccounts"}]} creationTimestamp: "2023-03-13T08:59:46Z" name: shared-secret-cluster-role-binding resourceVersion: "1575464" uid: dd11825d-834a-4807-ab82-30dc0a415985 roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: shared-secret-cluster-role subjects: - apiGroup: rbac.authorization.k8s.io kind: Group name: system:serviceaccounts - kind: ServiceAccount name: builder namespace: project-101 6. Run SubjectAccessReview call to validate authoriztion: $ curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data "@/tmp/post.json" https://api.<url>:6443/apis/authorization.k8s.io/v1/subjectaccessreviews { "kind": "SubjectAccessReview", "apiVersion": "authorization.k8s.io/v1", "metadata": { "creationTimestamp": null, "managedFields": [ { "manager": "curl", "operation": "Update", "apiVersion": "authorization.k8s.io/v1", "time": "2023-03-13T09:17:45Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:spec": { "f:resourceAttributes": { ".": {}, "f:group": {}, "f:name": {}, "f:namespace": {}, "f:resource": {}, "f:verb": {} }, "f:user": {} } } } ] }, "spec": { "resourceAttributes": { "namespace": "project-100", "verb": "use", "group": "sharedresource.openshift.io", "resource": "sharedsecrets", "name": "shared-subscription" }, "user": "system:serviceaccount:project-100:builder" }, "status": { "allowed": false } }
Actual results:
$ curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data "@/tmp/post.json" https://api.<url>:6443/apis/authorization.k8s.io/v1/subjectaccessreviews { "kind": "SubjectAccessReview", "apiVersion": "authorization.k8s.io/v1", "metadata": { "creationTimestamp": null, "managedFields": [ { "manager": "curl", "operation": "Update", "apiVersion": "authorization.k8s.io/v1", "time": "2023-03-13T09:17:45Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:spec": { "f:resourceAttributes": { ".": {}, "f:group": {}, "f:name": {}, "f:namespace": {}, "f:resource": {}, "f:verb": {} }, "f:user": {} } } } ] }, "spec": { "resourceAttributes": { "namespace": "project-100", "verb": "use", "group": "sharedresource.openshift.io", "resource": "sharedsecrets", "name": "shared-subscription" }, "user": "system:serviceaccount:project-100:builder" }, "status": { "allowed": false } }
Expected results:
$ curl -k -X POST -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data "@/tmp/post.json" https://api.<url>:6443/apis/authorization.k8s.io/v1/subjectaccessreviews { "kind": "SubjectAccessReview", "apiVersion": "authorization.k8s.io/v1", "metadata": { "creationTimestamp": null, "managedFields": [ { "manager": "curl", "operation": "Update", "apiVersion": "authorization.k8s.io/v1", "time": "2023-03-13T09:16:47Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:spec": { "f:resourceAttributes": { ".": {}, "f:group": {}, "f:name": {}, "f:namespace": {}, "f:resource": {}, "f:verb": {} }, "f:user": {} } } } ] }, "spec": { "resourceAttributes": { "namespace": "project-101", "verb": "use", "group": "sharedresource.openshift.io", "resource": "sharedsecrets", "name": "shared-subscription" }, "user": "system:serviceaccount:project-101:builder" }, "status": { "allowed": true, "reason": "RBAC: allowed by ClusterRoleBinding \"shared-secret-cluster-role-binding\" of ClusterRole \"shared-secret-cluster-role\" to ServiceAccount \"builder/project-101\"" } }
Additional info:
The goal is to use the Group "system:serviceaccounts" to authorize all serviceAccounts to access the given resources to avoid listing all namespaces specifically and thus have the need to create a controller that needs to update a list or similar.
Description of problem:
When creating an image for arm, i.e. using: architecture: arm64 and running $ ./bin/openshift-install agent create image --dir ./cluster-manifests/ --log-level debug the output indicates the the correct base iso was extracted from the release: INFO Extracting base ISO from release payload DEBUG Using mirror configuration DEBUG Fetching image from OCP release (oc adm release info --image-for=machine-os-images --insecure=true --icsp-file=/tmp/icsp-file347546417 registry.ci.openshift.org/origin/release:4.13) DEBUG extracting /coreos/coreos-aarch64.iso to /home/bfournie/.cache/agent/image_cache, oc image extract --path /coreos/coreos-aarch64.iso:/home/bfournie/.cache/agent/image_cache --confirm --icsp-file=/tmp/icsp-file3609464443 registry.ci.openshift.org/origin/4.13-2023-03-09-142410@sha256:e3c4445cabe16ca08c5b874b7a7c9d378151eb825bacc90e240cfba9339a828c INFO Base ISO obtained from release and cached at /home/bfournie/.cache/agent/image_cache/coreos-aarch64.iso DEBUG Extracted base ISO image /home/bfournie/.cache/agent/image_cache/coreos-aarch64.iso from release payload When in fact the ISO was not extracted from the release image and the command failed: ERROR failed to write asset (Agent Installer ISO) to disk: cannot generate ISO image due to configuration errors FATAL failed to fetch Agent Installer ISO: failed to generate asset "Agent Installer ISO": provided device /home/bfournie/.cache/agent/image_cache/coreos-aarch64.iso does not exist
Version-Release number of selected component (if applicable):
4.13
How reproducible:
every time
Steps to Reproduce:
1. Set architecture: arm64 for all hosts in install-config.yaml 2. Run the openshift-install command as above 3. See the log messages and the command fails
Actual results:
Invalid messages are logged and command fails
Expected results:
Command succeeds
Additional info:
Description of problem:
During the documentation writing phase, we have received suggestions to improve texts in the vSphere Connection modal. We should address them. https://docs.google.com/document/d/1jLnHuJyOR5nyuFTpSO6LcuHDVrVGUSs2EMpLFey1qDQ/edit
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Deploy OCP cluster on the vSphere platform 2. On the homepage of the Console, see VCenter status plugin 3.
Actual results:
Expected results:
Additional info:
It's about rephrasing only.
Description of problem:
When doing an IPV6 only agent based installer on bare metal this fails if the RendezvousIP value is not canonical.
Version-Release number of selected component (if applicable):
OCP 4.12
How reproducible:
Every time.
Steps to Reproduce:
1. Configure the agent through agen-config.yaml for an IPV6 only install. 2. Set to something that is correct, but not canonical: for example: rendezvousIP: 2a00:8a00:4000:020c:0000:0000:0018:143c 3. Generate discovery iso and boot nodes.
Actual results:
Installation fails because the set-node-zero.sh script fails to discover that it is running on node zero.
Expected results:
Installation completes.
Additional info:
The code that detects wether a host is node-zero uses this: is_rendezvous_host=$(ip -j address | jq "[.[].addr_info] | flatten | map(.local==\"$NODE_ZERO_IP\") | any") This fails in unexpected ways with IPV6 that are not canonical, as the output of ip address is always canonical, but in this case the value for $NODE_ZERO_IP wasn't.
We did test this on the node itself: [root@slabnode2290 bin]# ip -j address | jq '[.[].addr_info] | flatten | map(.local=="2a00:8a00:4000:020c:0000:0000:0018:143c") | any' false [root@slabnode2290 bin]# ip -j address | jq '[.[].addr_info] | flatten | map(.local=="2a00:8a00:4000:20c::18:143c") | any' true A solution may be to use a tool like ipcalc, once available, to do this test and make it less strict. In the mean time a note in the docs would be a good idea.
This is a clone of issue OCPBUGS-18990. The following is the description of the original issue:
—
Description of problem:
The script refactoring from https://github.com/openshift/cluster-etcd-operator/pull/1057 introduced a regression. Since the static pod list variable was renamed, it is now empty and won't restore the non-etcd pod yamls anymore.
Version-Release number of selected component (if applicable):
4.14 and later
How reproducible:
always
Steps to Reproduce:
1. create a cluster 2. restore using cluster-restore.sh
Actual results:
the apiserver and other static pods are not immediately restored The script only outputs this log: removing previous backup /var/lib/etcd-backup/member Moving etcd data-dir /var/lib/etcd/member to /var/lib/etcd-backup starting restore-etcd static pod
Expected results:
the non-etcd static pods should be immediately restored by moving them into the manifest directory again. You can see this by the log output: Moving etcd data-dir /var/lib/etcd/member to /var/lib/etcd-backup starting restore-etcd static pod starting kube-apiserver-pod.yaml static-pod-resources/kube-apiserver-pod-7/kube-apiserver-pod.yaml starting kube-controller-manager-pod.yaml static-pod-resources/kube-controller-manager-pod-7/kube-controller-manager-pod.yaml starting kube-scheduler-pod.yaml static-pod-resources/kube-scheduler-pod-8/kube-scheduler-pod.yaml
Additional info:
Description of problem:
Pods are being terminated on Kubelet restart if they consume any device. In case of CNV this Pods are carrying VMs and the assuption is that Kubelet will not terminate the Pod in this case.
Version-Release number of selected component (if applicable):
4.14 / 4.13.z / 4.12.z
How reproducible:
This should be reproducable with any device plugin as far as goes my understanding
Steps to Reproduce:
1. Create Pod requesting device plugin 2. Restart Kubelet 3.
Actual results:
Admission error -> Pod terminates
Expected results:
No error -> Existing & Running Pods will continue running after Kubelet restart
Additional info:
The culprit seems to be https://github.com/kubernetes/kubernetes/pull/116376
Description of problem:
Currently when the oc-mirror command runs the generated ImageContentSourcePolicy.yaml should not include mirrors for the mirrored operator catalogs
This should be the case for registry located catalogs and oci fbc catalogs (located on disk)
Jennifer Power, Alex Flom can you help us confirm this is the expected behavior?
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1.Run the oc mirror command mirroring the catalog /bin/oc-mirror --config imageSetConfig.yaml docker://localhost:5000 --use-oci-feature --dest-use-http --dest-skip-tls with imagesetconfig: kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: local: path: /tmp/storageBackend mirror: operators: - catalog: oci:///home/user/catalogs/rhop4.12 # copied from registry.redhat.io/redhat/redhat-operator-index:v4.12 targetCatalog: "mno/redhat-operator-index" targetVersion: "v4.12" packages: - name: aws-load-balancer-operator
Actual results:
Catalog is included in the imageContentSourcePolicy.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operator-index namespace: openshift-marketplace spec: image: localhost:5000/mno/redhat-operator-index:v4.12 sourceType: grpc --- apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: labels: operators.openshift.org/catalog: "true" name: operator-0 spec: repositoryDigestMirrors: - mirrors: - localhost:5000/albo source: registry.redhat.io/albo - mirrors: - localhost:5000/mno source: mno - mirrors: - localhost:5000/openshift4 source: registry.redhat.io/openshift4
Expected results:
No catalog should be included in the imageContentSourcePolicy.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operator-index namespace: openshift-marketplace spec: image: localhost:5000/mno/redhat-operator-index:v4.12 sourceType: grpc --- apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: labels: operators.openshift.org/catalog: "true" name: operator-0 spec: repositoryDigestMirrors: - mirrors: - localhost:5000/albo source: registry.redhat.io/albo - mirrors: - localhost:5000/openshift4 source: registry.redhat.io/openshift4
Additional info:
Description of problem:
Looking at the telemetry data for Nutanix I noticed that the “host_type” for clusters installed with platform nutanix shows as “virt-unknown”. Do you know what needs to happen in the code to tell telemetry about host type being Nutanix? The problem is that we can’t track those installations with platform none, just IPI. Refer to the slack thread https://redhat-internal.slack.com/archives/C0211848DBN/p1687864857228739.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Create an OCP Nutanix cluster
Actual results:
The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as “virt-unknown”.
Expected results:
The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as "nutanix".
Additional info:
Description of problem:
Link to Openshift Route from service is breaking because of hardcoded value of targetPort. If the targetPort gets changed, the route still points to the older value of port as it's hardcoded
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install the latest available version of Openshift Pipelines 2. Create the pipeline and triggerbinding using the attached files 3. Add trigger to the created pipeline from devconsole UI, select the above created triggerbinding while adding trigger 4. Trigger an event using the curl command curl -X POST -d '{ "url": "https://www.github.com/VeereshAradhya/cli" }' -H 'Content-Type: application/json' <route> and make sure that the pipelinerun gets started 5. Update the tagetPort in the svc from 8080 to 8000 6. Again use the above curl command to trigger one more event
Actual results:
The curl command throws error
Expected results:
The curl command should be successful and the pipelinerun should get started successfully
Additional info:
Error: curl -X POST -d '{ "url": "https://www.github.com/VeereshAradhya/cli" }' -H 'Content-Type: application/json' http://el-event-listener-3o9zcv-test-devconsole.apps.ve412psi.psi.ospqa.com <html> <head> <meta name="viewport" content="width=device-width, initial-scale=1"> <style type="text/css"> body { font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; line-height: 1.66666667; font-size: 16px; color: #333; background-color: #fff; margin: 2em 1em; } h1 { font-size: 28px; font-weight: 400; } p { margin: 0 0 10px; } .alert.alert-info { background-color: #F0F0F0; margin-top: 30px; padding: 30px; } .alert p { padding-left: 35px; } ul { padding-left: 51px; position: relative; } li { font-size: 14px; margin-bottom: 1em; } p.info { position: relative; font-size: 20px; } p.info:before, p.info:after { content: ""; left: 0; position: absolute; top: 0; } p.info:before { background: #0066CC; border-radius: 16px; color: #fff; content: "i"; font: bold 16px/24px serif; height: 24px; left: 0px; text-align: center; top: 4px; width: 24px; } @media (min-width: 768px) { body { margin: 6em; } } </style> </head> <body> <div> <h1>Application is not available</h1> <p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p> <div class="alert alert-info"> <p class="info"> Possible reasons you are seeing this page: </p> <ul> <li> <strong>The host doesn't exist.</strong> Make sure the hostname was typed correctly and that a route matching this hostname exists. </li> <li> <strong>The host exists, but doesn't have a matching path.</strong> Check if the URL path was typed correctly and that the route was created using the desired path. </li> <li> <strong>Route and path matches, but all pods are down.</strong> Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running. </li> </ul> </div> </div> </body> </html>
Note:
The above scenario works fine if we create triggers using the yaml files instead of using devconsole UI
Description of the problem:
EnsureOperatorPrerequisite is using the cluster CPU architecture while on multi arch cluster the CPU architecture will always be multi. On update clusterm EnsureOperatorPrerequisite will not prevent the cluster from being updated but will fail on the next update request.
Steps to reproduce:
1. Register multi arch cluster (P or Z)
2. Update cluster with ODF operator
3. Update any cluster field
Actual results:
Cluster failed to update on the second time
Expected results:
Not to fail
Description of problem:
These alerts fire without a namespace label: * KubeStateMetricsListErrors * KubeStateMetricsWatchErrors * KubeletPlegDurationHigh * KubeletTooManyPods * KubeNodeReadinessFlapping * KubeletPodStartUpLatencyHigh Alerting rules without a namespace label make it harder for cluster admins to route the alerts.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Check the definitions of the said alerting rules.
Actual results:
The PromQL expressions aggregate away the namespace label and there's no static namespace label either.
Expected results:
Static namespace label in the rule definition.
Additional info:
https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide Alerts SHOULD include a namespace label indicating the source of the alert. Many alerts will include this by virtue of the fact that their PromQL expressions result in a namespace label. Others may require a static namespace label
Description of problem:
4.14 cluster installation failed with TECH_PREVIEW featuregate
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-04-03-002631
How reproducible:
Always on GCP and Azure platform
Steps to Reproduce:
1. Install 4.14 cluster with TECH_PREVIEW featuregate
Actual results:
Cluster Installation failed and shows below error
oc get pod -n openshift-kube-apiserver -l apiserver --show-labels
E0404 18:13:56.266461 73688 memcache.go:238] couldn't get current server API group list: Get "https://api.maxu-az-tp1.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": dial tcp 20.253.227.131:6443: i/o timeout
E0404 18:14:26.270883 73688 memcache.go:238] couldn't get current server API group list: Get "https://api.maxu-az-tp1.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": dial tcp 20.253.227.131:6443: i/o timeout
E0404 18:14:56.269363 73688 memcache.go:238] couldn't get current server API group list: Get "https://api.maxu-az-tp1.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": dial tcp 20.253.227.131:6443: i/o timeout
E0404 18:14:58.075111 73688 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0404 18:14:58.302392 73688 memcache.go:255] couldn't get resource list for security.openshift.io/v1: the server is currently unable to handle the request
E0404 18:14:58.309541 73688 memcache.go:255] couldn't get resource list for template.openshift.io/v1: the server is currently unable to handle the request
E0404 18:14:58.313497 73688 memcache.go:255] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
NAME READY STATUS RESTARTS AGE LABELS
kube-apiserver-maxu-az-tp1-86n5v-master-2 4/5 CrashLoopBackOff 7 (2m41s ago) 16m apiserver=true,app=openshift-kube-apiserver,revision=16
Expected results:
Cluster Installation should be success and not show any error
Additional info:
https://issues.redhat.com/browse/OCPQE-14686
https://drive.google.com/file/d/1EHVuPFaSJA50R2k8uVVUVDvGDCfG9ZYN/view?usp=sharing
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/?job=*4.14*-tp-*
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/?job=*4.14*-techpreview*
Description of problem:
When testing AWS on-prem BM expansion, the BMO is not able to reach the IRONIC_ENDPOINT
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-10-021647
How reproducible:
100%
Steps to Reproduce:
1. Install IPI AWS 3-node-compact cluster 2. Deploy BMO via YAML 3. Connect AWS against external on-prem env via VPN (out of scope) 4. Create BMH using "preprovisioningNetworkDataName" to push static IP and routes.
Actual results:
BMO is not able to reach the Ironic endpoint with the following error: ~~~ 2023-08-10T16:09:22.216778289Z {"level":"info","ts":"2023-08-10T16:09:22Z","logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"openshift-machine-api~openshift-qe-065","endpoint":"https://metal3-state.openshift-machine-api.svc.cluster.local:6385/v1/","error":"Get \"https://metal3-state.openshift-machine-api.svc.cluster.local:6385/v1\": dial tcp 172.30.19.119:6385: i/o timeout"} ~~~
Expected results:
Standard deploy
Additional info:
Must-gather provided separatedly
Description of problem:
OpenShift Console does not filter the SecretList when displaying the ServiceAccount details page When reviewing the details page of an OpenShift ServiceAccount, at the bottom of the page there is a SecretsList which is intended to display all of the relevant Secrets that are attached to the ServiceAccount. In OpenShift 4.8.X, this SecretList only displayed the relevant Secrets. In OpenShift 4.9+ the SecretList now displays all Secrets within the entire Namespace.
Version-Release number of selected component (if applicable):
4.8.57 < Most recent release without issue 4.9.0 < First release with issue 4.10.46 < Issue is still present
How reproducible:
Everytime
Steps to Reproduce:
1. Deploy a cluster with OpenShift 4.8.57 (or replace the OpenShift Console image with `sha256:9dd115a91a4261311c44489011decda81584e1d32982533bf69acf3f53e17540` ) 2. Access the ServiceAccounts Page ( User Management -> ServiceAccounts) 3. Click a ServiceAccount to display the Details page 4. Scroll down and review the Secrets section 5. Repeat steps with an OpenShift 4.9 release (or check using image `sha256:fc07081f337a51f1ab957205e096f68e1ceb6a5b57536ea6fc7fbcea0aaaece0` )
Actual results:
All Secrets in the Namespace are displayed
Expected results:
Only Secrets associated with the ServiceAccount are displayed
Additional info:
Lightly reviewing the code, the following links might be a good start: - https://github.com/openshift/console/blob/master/frontend/public/components/secret.jsx#L126 - https://github.com/openshift/console/blob/master/frontend/public/components/service-account.jsx#L151:L151
Description of problem:
On azure, delete a master, old machine stuck in Deleting, some pods in cluster are in ImagePullBackOff, check from azure console, new master did not add into lb backend, seems this lead the machine has no internet connection.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-02-12-024338
How reproducible:
Always
Steps to Reproduce:
1. Set up a cluster on Azure, networkType ovn 2. Delete a master 3. Check master and pod
Actual results:
Old machine stuck in Deleting, some pods are in ImagePullBackOff. $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaz2132-5ctmh-master-0 Deleting Standard_D8s_v3 westus 160m zhsunaz2132-5ctmh-master-1 Running Standard_D8s_v3 westus 160m zhsunaz2132-5ctmh-master-2 Running Standard_D8s_v3 westus 160m zhsunaz2132-5ctmh-master-flqqr-0 Running Standard_D8s_v3 westus 105m zhsunaz2132-5ctmh-worker-westus-dhwfz Running Standard_D4s_v3 westus 152m zhsunaz2132-5ctmh-worker-westus-dw895 Running Standard_D4s_v3 westus 152m zhsunaz2132-5ctmh-worker-westus-xlsgm Running Standard_D4s_v3 westus 152m $ oc describe machine zhsunaz2132-5ctmh-master-flqqr-0 -n openshift-machine-api |grep -i "Load Balancer" Internal Load Balancer: zhsunaz2132-5ctmh-internal Public Load Balancer: zhsunaz2132-5ctmh $ oc get node NAME STATUS ROLES AGE VERSION zhsunaz2132-5ctmh-master-0 Ready control-plane,master 165m v1.26.0+149fe52 zhsunaz2132-5ctmh-master-1 Ready control-plane,master 165m v1.26.0+149fe52 zhsunaz2132-5ctmh-master-2 Ready control-plane,master 165m v1.26.0+149fe52 zhsunaz2132-5ctmh-master-flqqr-0 NotReady control-plane,master 109m v1.26.0+149fe52 zhsunaz2132-5ctmh-worker-westus-dhwfz Ready worker 152m v1.26.0+149fe52 zhsunaz2132-5ctmh-worker-westus-dw895 Ready worker 152m v1.26.0+149fe52 zhsunaz2132-5ctmh-worker-westus-xlsgm Ready worker 152m v1.26.0+149fe52 $ oc describe node zhsunaz2132-5ctmh-master-flqqr-0 Warning ErrorReconcilingNode 3m5s (x181 over 108m) controlplane [k8s.ovn.org/node-chassis-id annotation not found for node zhsunaz2132-5ctmh-master-flqqr-0, macAddress annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0" , k8s.ovn.org/l3-gateway-config annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0"] $ oc get po --all-namespaces | grep ImagePullBackOf openshift-cluster-csi-drivers azure-disk-csi-driver-node-l8ng4 0/3 Init:ImagePullBackOff 0 113m openshift-cluster-csi-drivers azure-file-csi-driver-node-99k82 0/3 Init:ImagePullBackOff 0 113m openshift-cluster-node-tuning-operator tuned-bvvh7 0/1 ImagePullBackOff 0 113m openshift-dns node-resolver-2p4zq 0/1 ImagePullBackOff 0 113m openshift-image-registry node-ca-vxv87 0/1 ImagePullBackOff 0 113m openshift-machine-config-operator machine-config-daemon-crt5w 1/2 ImagePullBackOff 0 113m openshift-monitoring node-exporter-mmjsm 0/2 Init:ImagePullBackOff 0 113m openshift-multus multus-4cg87 0/1 ImagePullBackOff 0 113m openshift-multus multus-additional-cni-plugins-mc6vx 0/1 Init:ImagePullBackOff 0 113m openshift-ovn-kubernetes ovnkube-master-qjjsv 0/6 ImagePullBackOff 0 113m openshift-ovn-kubernetes ovnkube-node-k8w6j 0/6 ImagePullBackOff 0 113m
Expected results:
Replace master successful
Additional info:
Tested payload 4.13.0-0.nightly-2023-02-03-145213, same result. Before we have tested in 4.13.0-0.nightly-2023-01-27-165107, all works well.
Description of problem:
If the HyperShift operator is installed onto a cluster, it creates VPC Endpoint Services fronting the hosted Kubernetes API Server for downstream HyperShift clusters to connect to. These VPC Endpoint Services are tagged such that the uninstaller would attempt to action them: "kubernetes.io/cluster/${ID}: owned" However they cannot be deleted until all active VPC Endpoint Connections are rejected - the uninstaller should be able to do this.
Version-Release number of selected component (if applicable):
4.12 (but shouldn't be version-specific)
How reproducible:
100%
Steps to Reproduce:
1. Create an NLB + VPC Endpoint Service in the same VPC as a cluster 2. Tag it accordingly and create a VPC Endpoint connection to it
Actual results:
The uninstaller will not be able to delete the VPC Endpoint Service + the NLB that the VPC Endpoint Service is fronting
Expected results:
The VPC Endpoint Service can be completely cleaned up, which would allow the NLB to be cleaned up
Additional info:
Description of problem:
When clicking on "Duplicate RoleBinding" in the OpenShift Container Platform Web Console, users are taken to a form where they can review the duplicated RoleBinding. When the RoleBinding has a ServiceAccount as a subject, clicking "Create" leads to the following error: An error occurred Error "Unsupported value: "rbac.authorization.k8s.io": supported values: """ for field "subjects[0].apiGroup". The root cause seems to be that the field "subjects[0].apiGroup" is set to "rbac.authorization.k8s.io" even for "kind: ServiceAccount" subjects. For "kind: ServiceAccount" subjects, this field is not necessary but the "namespace" field should be set instead. The functionality works as expected for User and Group subjects.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.12.19
How reproducible:
Always
Steps to Reproduce:
1. In the OpenShift Container Platform Web Console, click on "User Management" => "Role Bindings" 2. Search for a RoleBinding that has a "ServiceAccount" as the subject. On the far right, click on the dots and choose "Duplicate RoleBinding" 3. Review the fields, set a new name for the duplicated RoleBinding, click "Create"
Actual results:
Duplicating fails with the following error message being shown: An error occurred Error "Unsupported value: "rbac.authorization.k8s.io": supported values: """ for field "subjects[0].apiGroup".
Expected results:
RoleBinding is duplicated without an error message
Additional info:
Reproduced with OpenShift Container Platform 4.12.18 and 4.12.19
Description of problem:
The readme.md of builder is just a one liner overview of project. It would be helpful to have some additional details added for new contributors/visitors of the project.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Install IPI cluster where all nodes are provisioned from azure marketplace image with purchase plan. install-config.yaml: --------------------------- platform: azure: region: eastus baseDomainResourceGroupName: os4-common defaultMachinePlatform: osImage: publisher: Redhat <---- contains uppercase letter offer: rh-ocp-worker sku: rh-ocp-worker version: 4.8.2021122100 plan: WithPurchasePlan as some marketplace images are free without plan, so pulisher in install-config should come from output of `az vm image list` # az vm image list --offer rh-ocp-worker --all -otable Architecture Offer Publisher Sku Urn Version -------------- ------------- -------------- ------------------ -------------------------------------------------------------- -------------- x64 rh-ocp-worker redhat-limited rh-ocp-worker redhat-limited:rh-ocp-worker:rh-ocp-worker:4.8.2021122100 4.8.2021122100 x64 rh-ocp-worker RedHat rh-ocp-worker RedHat:rh-ocp-worker:rh-ocp-worker:4.8.2021122100 4.8.2021122100 x64 rh-ocp-worker redhat-limited rh-ocp-worker-gen1 redhat-limited:rh-ocp-worker:rh-ocp-worker-gen1:4.8.2021122100 4.8.2021122100 x64 rh-ocp-worker RedHat rh-ocp-worker-gen1 RedHat:rh-ocp-worker:rh-ocp-worker-gen1:4.8.2021122100 4.8.2021122100 the image plan is as below, its publisher is lowercase. # az vm image show --urn RedHat:rh-ocp-worker:rh-ocp-worker:4.8.2021122100 --query plan { "name": "rh-ocp-worker", "product": "rh-ocp-worker", "publisher": "redhat" } From installer https://github.com/openshift/installer/blob/master/data/data/azure/bootstrap/main.tf#L243-L246, publisher property in image plan is from pulisher what we set in install-config.yaml, installer should use the publisher property from image plan output. But image plan is case-sensitive, bootstrap instance is provisioned failed with below error in such case. Unable to deploy from the Marketplace image or a custom image sourced from Marketplace image. The part number in the purchase information for VM '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima15image1-flg24-rg/providers/Microsoft.Compute/virtualMachines/jima15image1-flg24-bootstrap' is not as expected. Beware that the Plan object's properties are case-sensitive. Learn more about common virtual machine error codes. similar errors when provisioning worker instances from this image where image publisher contains upper case but publisher in its plan is all lowercase. worker machineset: ---------------------------- Spec: Lifecycle Hooks: Metadata: Provider ID: azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-cc5g2rw8-55267-q66k7-rg/providers/Microsoft.Compute/virtualMachines/ci-op-cc5g2rw8-55267-q66k7-worker-southcentralus1-dq6sp Provider Spec: Value: Accelerated Networking: true API Version: machine.openshift.io/v1beta1 Credentials Secret: Name: azure-cloud-credentials Namespace: openshift-machine-api Diagnostics: Boot: Storage Account Type: AzureManaged Image: Offer: rh-ocp-worker Publisher: RedHat Resource ID: Sku: rh-ocp-worker Type: WithPurchasePlan Version: 4.8.2021122100 Kind: AzureMachineProviderSpec Location: southcentralus Managed Identity: ci-op-cc5g2rw8-55267-q66k7-identity error when provision worker instance: Unable to deploy from the Marketplace image or a custom image sourced from Marketplace image. The part number in the purchase information for VM '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-cc5g2rw8-55267-q66k7-rg/providers/Microsoft.Compute/virtualMachines/ci-op-cc5g2rw8-55267-q66k7-worker-southcentralus1-mmr2h' is not as expected. Beware that the Plan object's properties are case-sensitive. Learn more about common virtual machine error codes.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-11-055332
How reproducible:
Always on 4.14 for bootstrap/masters Always on 4.11+ for workers
Steps to Reproduce:
1. Config osImage for all nodes in install-config, set publisher to RedHat 2. install cluster. 3.
Actual results:
Bootstrap instance is provisioned failed.
Expected results:
installation is successful.
Additional info:
Installation is successful when setting publisher to "redhat"
Description of problem:
A build which works on 4.12 errored out on 4.13.
Version-Release number of selected component (if applicable):
oc --context build02 get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-ec.3 True False 4d2h Cluster version is 4.13.0-ec.3
How reproducible:
Always
Steps to Reproduce:
1. oc new-project hongkliu-test 2. oc create is test-is --as system:admin 3. oc apply -f test-bc.yaml # the file is in the attachment
Actual results:
oc --context build02 logs test-bc-5-build Defaulted container "docker-build" out of: docker-build, manage-dockerfile (init) time="2023-02-20T19:13:38Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled" I0220 19:13:38.405163 1 defaults.go:112] Defaulting to storage driver "overlay" with options [mountopt=metacopy=on]. Caching blobs under "/var/cache/blobs".Pulling image image-registry.openshift-image-registry.svc:5000/ci/html-proofer@sha256:684aae4e929e596f7042c34a3604c81137860187305f775c2380774bda4b6b08 ... Trying to pull image-registry.openshift-image-registry.svc:5000/ci/html-proofer@sha256:684aae4e929e596f7042c34a3604c81137860187305f775c2380774bda4b6b08... Getting image source signatures Copying blob sha256:aa8ae8202b42d1c70c3a7f65680eabc1c562a29227549b9a1b33dc03943b20d2 Copying blob sha256:31326f32ac37d5657248df0a6aa251ec6a416dab712ca1236ea40ca14322a22c Copying blob sha256:b21786fe7c0d7561a5b89ca15d8a1c3e4ea673820cd79f1308bdfd8eb3cb7142 Copying blob sha256:68296e6645b26c3af42fa29b6eb7f5befa3d8131ef710c25ec082d6a8606080d Copying blob sha256:6b1c37303e2d886834dab68eb5a42257daeca973bbef3c5d04c4868f7613c3d3 Copying blob sha256:cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08 Copying blob sha256:46cf6a1965a3b9810a80236b62c42d8cdcd6fb75f9b58d1b438db5736bcf2669 Copying config sha256:9aefe4e59d3204741583c5b585d4d984573df8ff751c879c8a69379c168cb592 Writing manifest to image destination Storing signatures Adding transient rw bind mount for /run/secrets/rhsm STEP 1/4: FROM image-registry.openshift-image-registry.svc:5000/ci/html-proofer@sha256:684aae4e929e596f7042c34a3604c81137860187305f775c2380774bda4b6b08 STEP 2/4: RUN apk add --no-cache bash fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/main/x86_64/APKINDEX.tar.gz fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/community/x86_64/APKINDEX.tar.gz (1/1) Installing bash (5.0.11-r1) Executing bash-5.0.11-r1.post-install ERROR: bash-5.0.11-r1.post-install: script exited with error 127 Executing busybox-1.31.1-r9.trigger ERROR: busybox-1.31.1-r9.trigger: script exited with error 127 1 error; 21 MiB in 40 packages error: build error: building at STEP "RUN apk add --no-cache bash": while running runtime: exit status 1
Expected results:
Additional info:
Run the build on build01 (4.12.4) and it works fine. oc --context build01 get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.4 True False 2d11h Cluster version is 4.12.4
Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/64
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Following doc[1] to assign custom role with minimum permission for destroying cluster to installer Service Principle. As read permission misses on public dns zone and private dns zone in that doc for destroying IPI cluster, public dns records have no permission to be removed. But installer destroy is completed without any warning message. $ ./openshift-install destroy cluster --dir ipi --log-level debug DEBUG OpenShift Installer 4.13.0-0.nightly-2023-02-16-120330 DEBUG Built from commit c0bf49ca9e83fd00dfdfbbdddd47fbe6b5cdd510 INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" DEBUG deleting public records DEBUG deleting resource group INFO deleted resource group=jima-ipi-role-l7qgz-rg DEBUG deleting application registrations DEBUG Purging asset "Metadata" from disk DEBUG Purging asset "Master Ignition Customization Check" from disk DEBUG Purging asset "Worker Ignition Customization Check" from disk DEBUG Purging asset "Terraform Variables" from disk DEBUG Purging asset "Kubeconfig Admin Client" from disk DEBUG Purging asset "Kubeadmin Password" from disk DEBUG Purging asset "Certificate (journal-gatewayd)" from disk DEBUG Purging asset "Cluster" from disk INFO Time elapsed: 6m16s INFO Uninstallation complete! $ az network dns record-set a list --resource-group os4-common --zone-name qe.azure.devcluster.openshift.com -o table| grep jima-ipi-role *.apps.jima-ipi-role os4-common 30 A kubernetes.io_cluster.jima-ipi-role-l7qgz="owned" $ az network dns record-set cname list --resource-group os4-common --zone-name qe.azure.devcluster.openshift.com -o table| grep jima-ipi-role api.jima-ipi-role os4-common 300 CNAME kubernetes.io_cluster.jima-ipi-role-l7qgz="owned" [1] https://docs.google.com/document/d/1iEs7T09Opj0iMXvpKeSatsAyPoda_gWQvFKQuWA3QdM/edit#
Version-Release number of selected component (if applicable):
4.13 nightly build
How reproducible:
always
Steps to Reproduce:
1. Create custom role with limited permission for destroying cluster, without read permission on public dns zone and private dns zone. 2. Assign the custom role to Service Principal 3. Use this SP to destroy cluster
Actual results:
Although some permissions missed, installer destroy cluster completed without any warning.
Expected results:
Installer should have some warning message that indicate resources leftover with some specific reason, so that user can process further.
Additional info:
Description of problem:
When creating a hosted cluster on a management cluster that has an imagecontentsourcepolicy that does not include openshift-release-dev or ocp/release images, the control plane operator fails reconciliation with an error: {"level":"error","ts":"2023-08-22T18:26:07Z","msg":"Reconciler error","controller":"hostedcontrolplane","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedControlPlane","HostedControlPlane":{"name":"jiezhao-test","namespace":"clusters-jiezhao-test"},"namespace":"clusters-jiezhao-test","name":"jiezhao-test","reconcileID":"9b3c101b-b4d2-4d9e-b71c-ede9e0b55374","error":"failed to update control plane: failed to reconcile ignition server: failed to parse private registry hosted control plane image reference \"\": repository name must have at least one component","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. Create an ImageContentSourcePolicy on a management cluster: apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: name: brew-registry resourceVersion: "31794" uid: 7231c634-da35-4c56-b2ef-be48c2571a9c spec: repositoryDigestMirrors: - mirrors: - brew.registry.redhat.io source: registry.redhat.io - mirrors: - brew.registry.redhat.io source: registry.stage.redhat.io - mirrors: - brew.registry.redhat.io source: registry-proxy.engineering.redhat.com 2. Install the latest hypershift operator and create a hosted cluster with the latest 4.14 ci build
Actual results:
The hostedcluster never creates machines and never gets to a Complete state
Expected results:
The hostedcluster comes up and gets to a Complete state
Additional info:
Description of problem:
When trying to delete a BMH object, which is unmanaged, the Metal3 cannot delete. The BMH object is unmanaged because it does not provide information about BMC (neither address, nor credentials).
In this case the Metal 3 tries to delete but fails and never finalizes. The BMH deletion gets stuc.
This is the log from MEtal3
{"level":"info","ts":1676531586.4898946,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org"} {"level":"info","ts":1676531586.4980938,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org"} {"level":"info","ts":1676531586.5050912,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org"} {"level":"info","ts":1676531586.5105371,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600} {"level":"info","ts":1676531586.51569,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org"} {"level":"info","ts":1676531586.5191178,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600} {"level":"info","ts":1676531586.525755,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600} {"level":"info","ts":1676531586.5356712,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600} {"level":"info","ts":1676532186.5117555,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org"} {"level":"info","ts":1676532186.5195107,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org"} {"level":"info","ts":1676532186.526355,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org"} {"level":"info","ts":1676532186.5317476,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600} {"level":"info","ts":1676532186.5361836,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org"} {"level":"info","ts":1676532186.5404322,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600} {"level":"info","ts":1676532186.5482726,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600} {"level":"info","ts":1676532186.555394,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600} {"level":"info","ts":1676532532.3448665,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org"} {"level":"info","ts":1676532532.344922,"logger":"controllers.BareMetalHost","msg":"hardwareData is ready to be deleted","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org"} {"level":"info","ts":1676532532.3656478,"logger":"controllers.BareMetalHost","msg":"Initiating host deletion","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged"} {"level":"error","ts":1676532532.3656952,"msg":"Reconciler error","controller":"baremetalhost","controllerGroup":"metal3.io","controllerKind":"BareMetalHost","bareMetalHost":{"name":"worker-1.el8k-ztp-1.hpecloud.org","namespace":"openshift-machine-api"}, "namespace":"openshift-machine-api","name":"worker-1.el8k-ztp-1.hpecloud.org","reconcileID":"525a5b7d-077d-4d1e-a618-33d6041feb33","error":"action \"unmanaged\" failed: failed to determine current provisioner capacity: failed to parse BMC address informa tion: missing BMC address","errorVerbose":"missing BMC address\ngithub.com/metal3-io/baremetal-operator/pkg/hardwareutils/bmc.NewAccessDetails\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/github.com/metal3-io/baremetal-operator/pkg/hardwareu tils/bmc/access.go:145\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).bmcAccess\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:112\ngithub.com/metal3-io/baremetal-operator/pkg/pro visioner/ironic.(*ironicProvisioner).HasCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1922\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ensureCapacity\n\t/go/src/githu b.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:83\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/meta l3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:175\ngithub.com/metal 3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareM etalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremet al-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/contr oller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/contro ller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\ n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\nfailed to parse BMC address information\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).bmcAccess\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/iro nic/ironic.go:114\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).HasCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1922\ngithub.com/metal3-io/baremetal-operator/controlle rs/metal3%2eio.(*hostStateMachine).ensureCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:83\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n \t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator /controllers/metal3.io/host_state_machine.go:175\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithu b.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controll er.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/sr c/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal- operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller- runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\nfailed to determine current provisioner capacity\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ensur eCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:85\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n\t/go/src/github.com/metal3-io/baremetal -operator/controllers/metal3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machin e.go:175\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithub.com/metal3-io/baremetal-operator/contr ollers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/gi thub.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operato r/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-r untime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controll er.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\naction \"unmanaged\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operato r/controllers/metal3.io/baremetalhost_controller.go:230\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/contr oller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller -runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller. (*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594","stacktrace":"sigs.k8s.io/cont roller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/contr oller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Provide a BMH object with no BMC credentials. The BMH is set unmanaged.
Steps to Reproduce:
1. delete the object 2. gets stuck 3.
Actual results:
get stuck deletiong
Expected results:
Metal3 detects the BMH is unmanaged, and dont try to do deprovisioning.
Additional info:
Description of problem:
APIServer service not selected correctly for PublicAndPrivate when external-dns isn't configured. Image: 4.14 Hypershift operator + OCP 4.14.0-0.nightly-2023-03-23-050449 jiezhao-mac:hypershift jiezhao$ oc get hostedcluster/jz-test -n clusters -ojsonpath='{.spec.platform.aws.endpointAccess}{"\n"}' PublicAndPrivate - lastTransitionTime: "2023-03-24T15:13:15Z" message: Cluster operators console, dns, image-registry, ingress, insights, kube-storage-version-migrator, monitoring, openshift-samples, service-ca are not available observedGeneration: 3 reason: ClusterOperatorsNotAvailable status: "False" type: ClusterVersionSucceeding services: - service: APIServer servicePublishingStrategy: type: LoadBalancer - service: OAuthServer servicePublishingStrategy: type: Route - service: Konnectivity servicePublishingStrategy: type: Route - service: Ignition servicePublishingStrategy: type: Route - service: OVNSbDb servicePublishingStrategy: type: Route jiezhao-mac:hypershift jiezhao$ oc get service -n clusters-jz-test | grep kube-apiserver kube-apiserver LoadBalancer 172.30.211.131 aa029c422933444139fb738257aedb86-9e9709e3fa1b594e.elb.us-east-2.amazonaws.com 6443:32562/TCP 34m kube-apiserver-private LoadBalancer 172.30.161.79 ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com 6443:32100/TCP 34m jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ cat hostedcluster.kubeconfig | grep server server: https://ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com:6443 jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig E0324 11:17:44.003589 95300 memcache.go:238] couldn't get current server API group list: Get "https://ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com:6443/api?timeout=32s": dial tcp 10.0.129.24:6443: i/o timeout
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1.Create a PublicAndPrivate cluster without external-dns 2.access the guest cluster (it should fail) 3.
Actual results:
unable to access the guest cluster via 'oc get node --kubeconfig=<guest cluster kubeconfig>', some guest cluster co are not available
Expected results:
The cluster is up and running, the guest cluster can be accessed via 'oc get node --kubeconfig=<guest cluster kubeconfig>'
Additional info:
Dummy bug to track adding the test to openshift/origin.
Description of problem:
Reported upstream in https://github.com/kubernetes/cloud-provider-openstack/issues/2217 Not specifically reproduced in OpenShift, but I have no reason to think we would not be affected, and I know we have users with strict proxy requirements. The user's configuration requires all OpenStack API requests from the tenant network to go through a proxy. They have configured a proxy 'globally' in their cluster in a manner which also affects the CSI driver. Attempting to attach a volume to a pod fails. Inspecting the logs we see that cinder attempted to attach the volume to the proxy server, not the node hosting the pod. The reason for this is that the metadata request was also proxied, meaning the returned values relate to the proxy server, not the local server.
Version-Release number of selected component (if applicable):
4.13, but likely all versions since we enabled CSI
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Ever since the introduction of the latest invariants feature in origin, MicroShift is unable to run the conformance tests. Failing invariants include load balancer, image registry and kube-apiserver (https://github.com/openshift/origin/blob/master/pkg/defaultinvariants/types.go#L48-L52) and they are tested for disruptions. These tests don't apply in MicroShift because some of those components don't exist, and none of them are HA. Requiring the invariants without checking the platform breaks conformance testing in MicroShift.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Run `openshift-tests run openshift/conformance --provider none` with MicroShift kubeconfig.
Steps to Reproduce:
1. 2. 3.
Actual results:
KUBECONFIG=~/.kube/config ./openshift-tests run openshift/conformance -v 2 --provider none Aug 3 11:37:39.859: INFO: MicroShift cluster with version: 4.14.0_0.nightly_2023_06_30_131338_20230703175041_1b2a630fc I0803 11:37:39.859929 9250 test_setup.go:94] Extended test version v4.1.0-6883-g6ee9dc5 openshift-tests version: v4.1.0-6883-g6ee9dc5 Aug 3 11:37:39.898: INFO: Enabling in-tree volume drivers Attempting to pull tests from external binary... Falling back to built-in suite, failed reading external test suites: unable to extract k8s-tests binary: failed reading ClusterVersion/version: the server could not find the requested resource (get clusterversions.config.openshift.io version) W0803 11:37:40.849399 9250 warnings.go:70] unknown field "spec.tls.externalCertificate" Suite run returned error: [namespaces "openshift-image-registry" not found, the server could not find the requested resource (get infrastructures.config.openshift.io cluster)] No manifest filename passed error running options: [namespaces "openshift-image-registry" not found, the server could not find the requested resource (get infrastructures.config.openshift.io cluster)]error: [namespaces "openshift-image-registry" not found, the server could not find the requested resource (get infrastructures.config.openshift.io cluster)]
Expected results:
Tests running to completion.
Additional info:
A nice addition would be having additional presubmits in origin to run Microshift conformance to catch these things earlier.
Adding dependabot to manage to the go module dependencies of the HyperShift repository
Description of the problem:
Day-2 host stuck in insufficient
How reproducible:
100%
Steps to reproduce:
1. See CI job
Actual results:
Day-2 host stuck in insufficient
Expected results:
Day-2 host becomes known
We should check if CBT is enabled in cluster's nodes on vSphere platform.
1. Perform a full sweep and log each node which has CBT enabled.
2. Create an alert if some VMs have CBT enabled and other don't.
3. Alert should not be emitted if all VMs in cluster are uniformly CBT enabled.
This will avoid issues like - https://issues.redhat.com/browse/OCPBUGS-12249?filter=12399251
dependencies for the ironic containers are quite old, we need to upgrade them to the latest available to keep up with upstream requirements
Description of problem:
Please check: https://issues.redhat.com/browse/OCPBUGS-18702?focusedId=23021716&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23021716 for more details. https://drive.google.com/drive/folders/14aSJs-lO6HC-2xYFlOTJtCZIQg3ekE85?usp=sharing (plz check recording "sc_form_typeerror.mp4").
Issues: 1. TypeError mentioned above. 2. Default params added by an extension are not getting added to the created StorageClass. 3. Validation for parameters added by an extension in not working correctly as well. 4. The Provisioner child details will be stuck once user selected 'openshift-storage.cephfs.csi.ceph.com'.
Version-Release number of selected component (if applicable):
4.14 (OCP)
How reproducible:
Steps to Reproduce:
1. Install ODF operator. 2. Create StorageSystem (once dynamic plugin is loaded). 3. Wait for a while for ODF related StorageClasses gets created. 4. Once they are created, go to "Create StorageSystem" form. 5. Switch to provisioners (rbd.csi.ceph) added by ODF dynamic plugin.
Actual results:
Page breaks with an error.
Expected results:
Page should not break. And functionality should be how it was acting before the refactoring introduced by PR: https://github.com/openshift/console/pull/13036
Additional info:
Stack trace: Caught error in a child component: TypeError: Cannot read properties of undefined (reading 'parameters') at allRequiredFieldsFilled (storage-class-form.tsx:204:1) at validateForm (storage-class-form.tsx:235:1) at storage-class-form.tsx:262:1 at invokePassiveEffectCreate (react-dom.development.js:23487:1) at HTMLUnknownElement.callCallback (react-dom.development.js:3945:1) at Object.invokeGuardedCallbackDev (react-dom.development.js:3994:1) at invokeGuardedCallback (react-dom.development.js:4056:1) at flushPassiveEffectsImpl (react-dom.development.js:23574:1) at unstable_runWithPriority (scheduler.development.js:646:1) at runWithPriority$1 (react-dom.development.js:11276:1) {componentStack: '\n at StorageClassFormInner (http://localhost:90...c03030668ef271da51f.js:491534:20)\n at Suspense'}
Description of problem:
Incorrect AWS ARN [1] is used for GovCloud and AWS China regions, which will cause the command `ccoctl aws create-all` to fail: Failed to create Identity provider: failed to apply public access policy to the bucket ci-op-bb5dgq54-77753-oidc: MalformedPolicy: Policy has invalid resource status code: 400, request id: VNBZ3NYDH6YXWFZ3, host id: pHF8v7C3vr9YJdD9HWamFmRbMaOPRbHSNIDaXUuUyrgy0gKCO9DDFU/Xy8ZPmY2LCjfLQnUDmtQ= Correct AWS ARN prefix: GovCloud (us-gov-east-1 and us-gov-west-1): arn:aws-us-gov AWS China (cn-north-1 and cn-northwest-1): arn:aws-cn [1] https://github.com/openshift/cloud-credential-operator/pull/526/files#diff-1909afc64595b92551779d9be99de733f8b694cfb6e599e49454b380afc58876R211
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-05-11-024616
How reproducible:
Always
Steps to Reproduce:
1. Run command: `aws create-all --name="${infra_name}" --region="${REGION}" --credentials-requests-dir="/tmp/credrequests" --output-dir="/tmp"` on GovCloud regions 2. 3.
Actual results:
Failed to create Identity provider
Expected results:
Create resources successfully.
Additional info:
Related PRs: 4.10: https://github.com/openshift/cloud-credential-operator/pull/531 4.11: https://github.com/openshift/cloud-credential-operator/pull/530 4.12: https://github.com/openshift/cloud-credential-operator/pull/529 4.13: https://github.com/openshift/cloud-credential-operator/pull/528 4.14: https://github.com/openshift/cloud-credential-operator/pull/526
DoD:
Go through all conditions
https://github.com/openshift/hypershift/blob/main/api/v1beta1/nodepool_conditions.go
https://github.com/openshift/hypershift/blob/main/api/v1beta1/hostedcluster_conditions.go
Add an e2e test that validate all of them match the expected state on cluster creation.
Description of problem:
The size of PVC/datadir-ibm-spectrum-scale-pmcollector-0 is displayed incorrectly in Openshift webconsole. The PVC size is shown as (negative) -17.6GiB.
Below is SC, PV and PVC details.
$ oc get storageclass NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE ibm-spectrum-fusion-mgmt-sc spectrumscale.csi.ibm.com Delete Immediate true 2d ibm-spectrum-fusion (default) spectrumscale.csi.ibm.com Delete Immediate true 2d ibm-spectrum-scale-internal kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 2d ibm-spectrum-scale-sample spectrumscale.csi.ibm.com Delete Immediate false 2d $ oc get pv control-1.ncw-az1-005.caas.bbtnet.com-pmcollector 25Gi RWO Retain Bound ibm-spectrum-scale/datadir-ibm-spectrum-scale-pmcollector-0 ibm-spectrum-scale-internal $ oc get pvc -A NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ibm-spectrum-scale datadir-ibm-spectrum-scale-pmcollector-0 Bound control-1.ncw-az1-005.caas.bbtnet.com-pmcollector 25Gi RWO ibm-spectrum-scale-internal 3d $ oc get pvc datadir-ibm-spectrum-scale-pmcollector-0 -n ibm-spectrum-scale kind: PersistentVolumeClaim apiVersion: v1 metadata: annotations: pv.kubernetes.io/bind-completed: 'yes' pv.kubernetes.io/bound-by-controller: 'yes' resourceVersion: '5360546' name: datadir-ibm-spectrum-scale-pmcollector-0 uid: 7a7d0609-0608-409f-91e1-209bb0b3c8d1 creationTimestamp: '2023-05-01T14:13:40Z' managedFields: - manager: kube-controller-manager operation: Update apiVersion: v1 time: '2023-05-01T14:13:40Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:annotations': .: {} 'f:pv.kubernetes.io/bind-completed': {} 'f:pv.kubernetes.io/bound-by-controller': {} 'f:labels': .: {} 'f:app.kubernetes.io/instance': {} 'f:app.kubernetes.io/name': {} 'f:spec': 'f:accessModes': {} 'f:resources': 'f:requests': .: {} 'f:storage': {} 'f:storageClassName': {} 'f:volumeMode': {} 'f:volumeName': {} - manager: kube-controller-manager operation: Update apiVersion: v1 time: '2023-05-01T14:13:40Z' fieldsType: FieldsV1 fieldsV1: 'f:status': 'f:accessModes': {} 'f:capacity': .: {} 'f:storage': {} 'f:phase': {} subresource: status namespace: ibm-spectrum-scale finalizers: - kubernetes.io/pvc-protection labels: app.kubernetes.io/instance: ibm-spectrum-scale app.kubernetes.io/name: pmcollector spec: accessModes: - ReadWriteOnce resources: requests: storage: 25Gi volumeName: control-1.ncw-az1-005.caas.bbtnet.com-pmcollector storageClassName: ibm-spectrum-scale-internal volumeMode: Filesystem status: phase: Bound accessModes: - ReadWriteOnce capacity: storage: 25Gi
==> However, when executing from pod ibm-spectrum-scale-pmcollector-0, the mountPath `/opt/IBM/zimon/data` where PVC/datadir-ibm-spectrum-scale-pmcollector-0 is mounted still shows that only 12K is used so far and 11G is the currently available space. [C49904@openshift-eng-bastion-vm ~]$ oc rsh ibm-spectrum-scale-pmcollector-0 Defaulted container "pmcollector" out of: pmcollector, sysmon sh-4.4$ df -Th | grep -iE 'size|zimon' Filesystem Type Size Used Avail Use% Mounted on tmpfs tmpfs 11G 12K 11G 1% /opt/IBM/zimon/config
Version-Release number of selected component (if applicable):
OCP 4.10.21 isf-operator.v2.4.0
How reproducible:
Steps to Reproduce:
1. by installing IBM Spectrum Scale 2. 3.
Actual results:
PVC size displayed from Openshift webconsole shows negative size value.
Expected results:
PVC size displayed from Openshift webconsole should not show negative size value.
Additional info:
Description of problem:
Application groups can not be deleted in topology
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create an application with an application group 2. Go to topology 3. Delete the application group containing the application
Actual results:
Application group persists in topology
Expected results:
The application group should be deleted
Additional info:
Pipeline API is giving 404 even if the pipelines operator is not installed
On https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.ci/release/4.14.0-0.ci-2023-06-30-020413, hypershift started permafailing
Description of problem:
CCO's ServiceAccount cannot list ConfigMaps at the cluster scope.
Steps to Reproduce:
1. Install an OCP cluster (4.14.0-0.nightly-2023-07-17-215017, CCO commit id = 0c80cc35f6ee4b45016050b3e5a8710a8ed4dd81) with default configuration (CCO in default mode) 2. Create a dummy CredentialsRequest as follows: apiVersion: cloudcredential.openshift.io/v1 kind: CredentialsRequest metadata: name: test-cr namespace: openshift-cloud-credential-operator spec: providerSpec: apiVersion: cloudcredential.openshift.io/v1 kind: AWSProviderSpec statementEntries: - action: - ec2:CreateTags effect: Allow resource: '*' stsIAMRoleARN: whatever secretRef: name: test-secret namespace: default serviceAccountNames: - default 3. Check CCO Pod logs: time="2023-07-18T10:02:45Z" level=info msg="reconciling clusteroperator status" time="2023-07-18T10:02:45Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/test-cr time="2023-07-18T10:02:45Z" level=info msg="adding finalizer: cloudcredential.openshift.io/deprovision" controller=credreq cr=openshift-cloud-credential-operator/test-cr secret=default/test-secret time="2023-07-18T10:02:45Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/test-cr time="2023-07-18T10:02:45Z" level=info msg="stsFeatureGateEnabled: false" actuator=aws cr=openshift-cloud-credential-operator/test-cr time="2023-07-18T10:02:45Z" level=info msg="stsDetected: false" actuator=aws cr=openshift-cloud-credential-operator/test-cr time="2023-07-18T10:02:45Z" level=info msg="clusteroperator status updated" controller=status time="2023-07-18T10:02:45Z" level=info msg="reconciling clusteroperator status" time="2023-07-18T10:02:45Z" level=info msg="reconciling clusteroperator status" time="2023-07-18T10:02:45Z" level=info msg="reconciling clusteroperator status" time="2023-07-18T10:02:45Z" level=info msg="reconciling clusteroperator status" W0718 10:02:45.352434 1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope E0718 10:02:45.352460 1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope W0718 10:02:46.512738 1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope E0718 10:02:46.512763 1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope W0718 10:02:48.859931 1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope E0718 10:02:48.859957 1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope W0718 10:02:53.514713 1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope E0718 10:02:53.514798 1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope W0718 10:03:03.042040 1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope E0718 10:03:03.042068 1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope W0718 10:03:25.023729 1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope E0718 10:03:25.023758 1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope time="2023-07-18T10:04:10Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics time="2023-07-18T10:04:10Z" level=info msg="reconcile complete" controller=metrics elapsed=4.470475ms W0718 10:04:11.033286 1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope E0718 10:04:11.033311 1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope W0718 10:04:42.316200 1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope E0718 10:04:42.316223 1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope W0718 10:05:40.852983 1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope E0718 10:05:40.853008 1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator" cannot list resource "configmaps" in API group "" at the cluster scope time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status" time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status" time="2023-07-18T10:06:10Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics time="2023-07-18T10:06:10Z" level=info msg="reconcile complete" controller=metrics elapsed=3.531182ms time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status" time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status" time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status" time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status" time="2023-07-18T10:06:10Z" level=info msg="reconciling clusteroperator status" ...
Description of problem:
Starting with OpenShift 4.13 we show a copy close to the OpenShift Route URL in the toplogy, the route list and detail page. But the Knative Route URL doesn't show this link as Vikram mentioned in this code review https://github.com/openshift/console/pull/12853#issuecomment-1594829827
Version-Release number of selected component (if applicable):
4.13+
How reproducible:
Always
Steps to Reproduce:
Actual results:
Copy button is not shown
Expected results:
Copy button should be displayed
Additional info:
Description of problem:
cluster-ingress-operator E2E has an error message: [controller-runtime] log.SetLogger(...) was never called, logs will not be displayed: Looks like newClient is called from two places, TestMain and TestIngressStatus
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Run E2E tests that call newClient, such as TestIngressStatus 2. Examine logs
Actual results:
[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed: goroutine 9120 [running]: runtime/debug.Stack() /usr/lib/golang/src/runtime/debug/stack.go:24 +0x65 sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot() /go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:59 +0xbd sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithName(0xc000113000, {0x1dd106b, 0x14}) /go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:147 +0x4c github.com/go-logr/logr.Logger.WithName({{0x21435e0, 0xc000113000}, 0x0}, {0x1dd106b?, 0xe?}) /go/src/github.com/openshift/cluster-ingress-operator/vendor/github.com/go-logr/logr/logr.go:336 +0x46 sigs.k8s.io/controller-runtime/pkg/client.newClient(0xc00086afc0, {0x0, 0xc0001a0fc0, {0x2144930, 0xc00033ac00}, 0x0, {0x0, 0x0}, 0x0}) /go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:115 +0xb4 sigs.k8s.io/controller-runtime/pkg/client.New(0xc00086afc0?, {0x0, 0xc0001a0fc0, {0x2144930, 0xc00033ac00}, 0x0, {0x0, 0x0}, 0x0}) /go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:101 +0x85 github.com/openshift/cluster-ingress-operator/pkg/operator/client.NewClient(0x0?) /go/src/github.com/openshift/cluster-ingress-operator/pkg/operator/client/client.go:83 +0x145 github.com/openshift/cluster-ingress-operator/test/e2e.TestIngressStatus(0xc000503520) /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/dns_ingressdegrade_test.go:33 +0x95 testing.tRunner(0xc000503520, 0x1f015a0) /usr/lib/golang/src/testing/testing.go:1576 +0x10b created by testing.(*T).Run /usr/lib/golang/src/testing/testing.go:1629 +0x3ea
Expected results:
No error message
Additional info:
This is due to 1.27 rebase
Essentially unmerge Christian's previous merge in the MCO that disabled the extension container.
Description of problem:
According to the slack thread attached: Cluster uninstallation is stuck when load balancers are removed before ingress controllers. This can happen when the ingress controller removal fails and the control plane operator moves on to deleting load balancers without waiting.
Version-Release number of selected component (if applicable):
4.12.z 4.13.z
How reproducible:
Whenever the load balancer is deleted before the ingress controller
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Load balancer deletion waits for the ingress controller deletion
Additional info:
Description of problem:
Image registry pruner job fails when cluster was installed without DeploymentConfig capability. Cluster was installed only with the following capapbilities: {\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"marketplace\", \"NodeTuning\" ] }}" image-pruner pods are failing with the following error: state: terminated: containerID: cri-o://69562d80cafb23a07b9f1d020e1943448916558986092d8540b9a0e1fc3731a1 exitCode: 1 finishedAt: "2023-08-21T00:07:37Z" message: | Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io) attempt #1 has failed (exit code 1), going to make another attempt... Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io) attempt #2 has failed (exit code 1), going to make another attempt... Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io) attempt #3 has failed (exit code 1), going to make another attempt... Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io) attempt #4 has failed (exit code 1), going to make another attempt... Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io) attempt #5 has failed (exit code 1), going to make another attempt... Error from server (NotFound): the server could not find the requested resource (get deploymentconfigs.apps.openshift.io) reason: Error startedAt: "2023-08-21T00:00:05Z"
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-16-114741
How reproducible:
100%
Steps to Reproduce:
1. Install SNO cluster withou DeploymentConfig capability 2. Check image pruner jobs status
Actual results:
Image pruner jobs do not complete because deploymentconfigs.apps.openshift.io api is not available.
Expected results:
Image pruner jobs can run without deploymentconfigs api
Additional info:
Description of problem:
OCP deployments are failing with machine-api-controller pod crashing.
Version-Release number of selected component (if applicable):
OCP 4.14.0-ec.3
How reproducible:
Always
Steps to Reproduce:
1. Deploy a Baremetal cluster 2. After bootstrap is completed, check the pods running in the openshift-machine-api namespace 3. Check machine-api-controllers-* pod status (it goes from Running to Crashing all the time) 4. Deployment eventually times out and stops with only the master nodes getting deployed.
Actual results:
machine-api-controllers-* pod remains in a crashing loop and OCP 4.14.0-ec.3 deployments fail.
Expected results:
machine-api-controllers-* pod remains running and OCP 4.14.0-ec.3 deployments are completed
Additional info:
Jobs with older nightly releases in 4.14 are passing, but since Saturday Jul 10th, our CI jobs are failing
$ oc version Client Version: 4.14.0-ec.3 Kustomize Version: v5.0.1 Kubernetes Version: v1.27.3+e8b13aa $ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master 37m v1.27.3+e8b13aa master-1 Ready control-plane,master 37m v1.27.3+e8b13aa master-2 Ready control-plane,master 38m v1.27.3+e8b13aa $ oc -n openshift-machine-api get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-autoscaler-operator-75b96869d8-gzthq 2/2 Running 0 48m 10.129.0.6 master-0 <none> <none> cluster-baremetal-operator-7c9cb8cd69-6bqcg 2/2 Running 0 48m 10.129.0.7 master-0 <none> <none> control-plane-machine-set-operator-6b65b5b865-w996m 1/1 Running 0 48m 10.129.0.22 master-0 <none> <none> machine-api-controllers-59694ff965-v4kxb 6/7 CrashLoopBackOff 7 (2m31s ago) 46m 10.130.0.12 master-2 <none> <none> machine-api-operator-58b54d7c86-cnx4w 2/2 Running 0 48m 10.129.0.8 master-0 <none> <none> metal3-6ffbb8dcd4-drlq5 6/6 Running 0 45m 192.168.62.22 master-1 <none> <none> metal3-baremetal-operator-bd95b6695-q6k7c 1/1 Running 0 45m 10.130.0.16 master-2 <none> <none> metal3-image-cache-4p7ln 1/1 Running 0 45m 192.168.62.22 master-1 <none> <none> metal3-image-cache-lfmb4 1/1 Running 0 45m 192.168.62.23 master-2 <none> <none> metal3-image-cache-txjg5 1/1 Running 0 45m 192.168.62.21 master-0 <none> <none> metal3-image-customization-65cf987f5c-wgqs7 1/1 Running 0 45m 10.128.0.17 master-1 <none> <none>
$ oc -n openshift-machine-api logs machine-api-controllers-59694ff965-v4kxb -c machine-controller | less ... E0710 15:55:08.230413 1 logr.go:270] controller-runtime/source "msg"="if kind is a CRD, it should be installed before calling Start" "error"="no matches for kind \"Metal3Remediation\" in version \"infrastructure.cluster.x-k8s.io/v1beta1\"" "kind"={"Group":"infrastructure.cluster.x-k8s.io","Kind":"Metal3Remediation"} E0710 15:55:14.019930 1 controller.go:210] "msg"="Could not wait for Cache to sync" "error"="failed to wait for metal3remediation caches to sync: timed out waiting for cache to be synced" "controller"="metal3remediation" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="Metal3Remediation" I0710 15:55:14.020025 1 logr.go:252] "msg"="Stopping and waiting for non leader election runnables" I0710 15:55:14.020054 1 logr.go:252] "msg"="Stopping and waiting for leader election runnables" I0710 15:55:14.020095 1 controller.go:247] "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machine-drain-controller" I0710 15:55:14.020147 1 controller.go:247] "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machineset-controller" I0710 15:55:14.020169 1 controller.go:247] "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machine-controller" I0710 15:55:14.020184 1 controller.go:249] "msg"="All workers finished" "controller"="machineset-controller" I0710 15:55:14.020181 1 controller.go:249] "msg"="All workers finished" "controller"="machine-drain-controller" I0710 15:55:14.020190 1 controller.go:249] "msg"="All workers finished" "controller"="machine-controller" I0710 15:55:14.020209 1 logr.go:252] "msg"="Stopping and waiting for caches" I0710 15:55:14.020323 1 logr.go:252] "msg"="Stopping and waiting for webhooks" I0710 15:55:14.020327 1 reflector.go:225] Stopping reflector *v1alpha1.BareMetalHost (10h53m58.149951981s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 I0710 15:55:14.020393 1 reflector.go:225] Stopping reflector *v1beta1.Machine (9h40m22.116205595s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 I0710 15:55:14.020399 1 logr.go:252] controller-runtime/webhook "msg"="shutting down webhook server" I0710 15:55:14.020437 1 reflector.go:225] Stopping reflector *v1.Node (10h3m14.461941979s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 I0710 15:55:14.020466 1 logr.go:252] "msg"="Wait completed, proceeding to shutdown the manager" I0710 15:55:14.020485 1 reflector.go:225] Stopping reflector *v1beta1.MachineSet (10h7m28.391827596s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 E0710 15:55:14.020500 1 main.go:218] baremetal-controller-manager/entrypoint "msg"="unable to run manager" "error"="failed to wait for metal3remediation caches to sync: timed out waiting for cache to be synced" E0710 15:55:14.020504 1 logr.go:270] "msg"="error received after stop sequence was engaged" "error"="leader election lost"
Our CI job logs can be seen here (RedHat SSO): https://www.distributed-ci.io/jobs/7da8ee48-8918-4a97-8e3c-f525d19583b8/files
Description of problem:
The AdditionalTrustBundle field in install-config.yaml can be used to add additional certs, however these certs are only propagated to the final image when the ImageContentSources field is also set for mirroring. If mirroring is not set then the additional certs will be on the bootstrap but not the final image. This can cause a problem when user has set up a proxy and wants to add additional certs as described here https://docs.openshift.com/container-platform/4.12/networking/configuring-a-custom-pki.html#installation-configure-proxy_configuring-a-custom-pki
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. In install-config.yaml set additionalTrustBundle and don't set imageContentSources. 2. Do an installation using the install-config.yaml. 3. After the final image is installed and rebooted view the certs in /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt.
Actual results:
The certs defined in additionalTrustBundle are not in /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt.
Expected results:
The certs defined in additionalTrustBundle will be in /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt even when imgeContentSources are not defined.
Additional info:
adding two minor flags for improvement in our CI tests:
https://github.com/openshift/cluster-etcd-operator/pull/1057
Description of problem:
Pull-through only checks for ICSP, ignoring IDMS/ITMS.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create an IDMS/ITMS rule (TODO: add specifics) example IDMS/ITMS specifics: apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: name: digest-mirror spec: imageDigestMirrors: - mirrors: - registry.access.redhat.com/ubi8/ubi-minimal source: quay.io/podman/hello mirrorSourcePolicy: NeverContactSource apiVersion: config.openshift.io/v1 kind: ImageTagMirrorSet metadata: name: tag-mirror spec: imageTagMirrors: - mirrors: - registry.access.redhat.com/ubi8/ubi-minimal source: quay.io/podman/hello mirrorSourcePolicy: NeverContactSource 2. Create an image stream with `referencePolicy: local`. Example: https://gist.github.com/flavianmissi/0518239edd6f51d54b5633212f2b2ac9 3. Pull the image from the image stream created above. Example `oc new-app test-1:latest`
Actual results:
Expected results:
Additional info:
Description of problem:
As a part of Chaos Monkey testing we tried to delete pod machine-config-controller in SNO+1. The pod machine-config-controller restart results in restart of daemonset/sriov-network-config-daemon and linuxptp-daemonpods pods as well.
1m47s Normal Killing pod/machine-config-controller-7f46c5d49b-w4p9s Stopping container machine-config-controller 1m47s Normal Killing pod/machine-config-controller-7f46c5d49b-w4p9s Stopping container oauth-proxy
openshift-sriov-network-operator 23m Normal Killing pod/sriov-network-config-daemon-pv4tr Stopping container sriov-infiniband-cni openshift-sriov-network-operator 23m Normal SuccessfulDelete daemonset/sriov-network-config-daemon Deleted pod: sriov-network-config-daemon-pv4tr
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
Restart the machine-config-controller pod in openshift-machine-config-operator namespace. 1. oc get pod -n openshift-machine-config-operator 2. oc delete pod/machine-config-controller-xxx -n openshift-machine-config-operator
Actual results:
It restarting the daemonset/sriov-network-config-daemon and linuxptp-daemonpods pods
Expected results:
It should not restart these pod
Additional info:
logs : https://drive.google.com/drive/folders/1XxYen8tzENrcIJdde8sortpyY5ZFZCPW?usp=share_link
Description of problem:
CNCC failed to assign egressIP to NIC for Azure Workload Identity Cluster Refer to https://issues.redhat.com/browse/CCO-294
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-11-055332
How reproducible:
Always
Steps to Reproduce:
1. Created a Azure Workload Identity Cluster by "workflow-launch cucushift-installer-rehearse-azure-ipi-cco-manual-workload-identity-tp 4.14" from cluster-bot 2. Configure egressIP 3.
Actual results:
% oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-3 10.0.128.100 % oc get cloudprivateipconfig -o yaml apiVersion: v1 items: - apiVersion: cloud.network.openshift.io/v1 kind: CloudPrivateIPConfig metadata: annotations: k8s.ovn.org/egressip-owner-ref: egressip-3 creationTimestamp: "2023-08-14T04:41:05Z" finalizers: - cloudprivateipconfig.cloud.network.openshift.io/finalizer generation: 1 name: 10.0.128.100 resourceVersion: "65159" uid: 2b7b1137-0e2e-46e8-9bca-1176330322a9 spec: node: ci-ln-b4tlp9t-1d09d-2chnb-worker-centralus1-jgqp2 status: conditions: - lastTransitionTime: "2023-08-14T04:41:17Z" message: 'Error processing cloud assignment request, err: network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="LinkedAuthorizationFailed" Message="The client ''d367c1b8-9f5d-4257-b5c8-363f61af32c2'' with object id ''d367c1b8-9f5d-4257-b5c8-363f61af32c2'' has permission to perform action ''Microsoft.Network/networkInterfaces/write'' on scope ''/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-ln-b4tlp9t-1d09d/providers/Microsoft.Network/networkInterfaces/ci-ln-b4tlp9t-1d09d-2chnb-worker-centralus1-jgqp2-nic''; however, it does not have permission to perform action ''Microsoft.Network/virtualNetworks/subnets/join/action'' on the linked scope(s) ''/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-ln-b4tlp9t-1d09d/providers/Microsoft.Network/virtualNetworks/ci-ln-b4tlp9t-1d09d-2chnb-vnet/subnets/ci-ln-b4tlp9t-1d09d-2chnb-worker-subnet'' or the linked scope(s) are invalid."' observedGeneration: 1 reason: CloudResponseError status: "False" type: Assigned node: ci-ln-b4tlp9t-1d09d-2chnb-worker-centralus1-jgqp2 kind: List metadata: resourceVersion: ""
Expected results:
EgressIP can be assigned to egress node
Additional info:
Description of problem:
Upgraded from 4.11.17 -> 4.12.0 rc3 and found (after successful upgrade) this repeating in Machine Config Operator logs: 2022-12-13T23:11:51.511167249Z W1213 23:11:51.511120 1 warnings.go:70] unknown field "spec.dns.metadata.creationTimestamp" 2022-12-13T23:11:51.511167249Z W1213 23:11:51.511140 1 warnings.go:70] unknown field "spec.dns.metadata.generation" 2022-12-13T23:11:51.511167249Z W1213 23:11:51.511143 1 warnings.go:70] unknown field "spec.dns.metadata.managedFields" 2022-12-13T23:11:51.511167249Z W1213 23:11:51.511146 1 warnings.go:70] unknown field "spec.dns.metadata.name" 2022-12-13T23:11:51.511167249Z W1213 23:11:51.511148 1 warnings.go:70] unknown field "spec.dns.metadata.resourceVersion" 2022-12-13T23:11:51.511167249Z W1213 23:11:51.511151 1 warnings.go:70] unknown field "spec.dns.metadata.uid" 2022-12-13T23:11:51.511167249Z W1213 23:11:51.511153 1 warnings.go:70] unknown field "spec.infra.metadata.creationTimestamp" 2022-12-13T23:11:51.511167249Z W1213 23:11:51.511155 1 warnings.go:70] unknown field "spec.infra.metadata.generation" 2022-12-13T23:11:51.511167249Z W1213 23:11:51.511157 1 warnings.go:70] unknown field "spec.infra.metadata.managedFields" 2022-12-13T23:11:51.511167249Z W1213 23:11:51.511159 1 warnings.go:70] unknown field "spec.infra.metadata.name" 2022-12-13T23:11:51.511167249Z W1213 23:11:51.511161 1 warnings.go:70] unknown field "spec.infra.metadata.resourceVersion" 2022-12-13T23:11:51.511211644Z W1213 23:11:51.511163 1 warnings.go:70] unknown field "spec.infra.metadata.uid"
Version-Release number of selected component (if applicable):
4.12.0-rc3 Platform agnostic installation
How reproducible:
Just once (working with user outside RH)
Steps to Reproduce:
1. Install 4.11.17 2. Set candidate-4.12 upgrade channel 3. Initiate upgrade (apply admin ack as needed) 4. After upgrade, check Machine Config Operator logs
Actual results:
The upgrade went fine and I don't see any symptoms outside of warnings repeating in MCO log
Expected results:
I don't expect the warnings to be logged repeatedly
Additional info:
Description of problem:
IPI installation to a shared VPC with 'credentialsMode: Manual' failed, due to no IAM service accounts for control-plane machines and compute machines
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-04-18-005127
How reproducible:
Always
Steps to Reproduce:
1. "create install-config", and then insert interested settings in install-config.yaml 2. "create manifests" 3. run "ccoctl" to create the required credentials 4. grant the above IAM service accounts the required permissions in the host project (see https://github.com/openshift/openshift-docs/pull/58474) 5. "create cluster"
Actual results:
The installer doesn't create the 2 IAM service accounts, one for control-plane machine and another for compute machine, so that no compute machine getting created, which leads to installation failure.
Expected results:
The installation should succeed.
Additional info:
FYI https://issues.redhat.com/browse/OCPBUGS-11605 $ gcloud compute instances list --filter='name~jiwei-0418' NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS jiwei-0418a-9kvlr-master-0 us-central1-a n2-standard-4 10.0.0.62 RUNNING jiwei-0418a-9kvlr-master-1 us-central1-b n2-standard-4 10.0.0.58 RUNNING jiwei-0418a-9kvlr-master-2 us-central1-c n2-standard-4 10.0.0.29 RUNNING $ gcloud iam service-accounts list --filter='email~jiwei-0418' DISPLAY NAME EMAIL DISABLED jiwei-0418a-14589-openshift-image-registry-gcs jiwei-0418a--openshift-i-zmwwh@openshift-qe.iam.gserviceaccount.com False jiwei-0418a-14589-openshift-machine-api-gcp jiwei-0418a--openshift-m-5cc5l@openshift-qe.iam.gserviceaccount.com False jiwei-0418a-14589-cloud-credential-operator-gcp-ro-creds jiwei-0418a--cloud-crede-p8lpc@openshift-qe.iam.gserviceaccount.com False jiwei-0418a-14589-openshift-gcp-ccm jiwei-0418a--openshift-g-bljz6@openshift-qe.iam.gserviceaccount.com False jiwei-0418a-14589-openshift-ingress-gcp jiwei-0418a--openshift-i-rm4vz@openshift-qe.iam.gserviceaccount.com False jiwei-0418a-14589-openshift-cloud-network-config-controller-gcp jiwei-0418a--openshift-c-6dk7g@openshift-qe.iam.gserviceaccount.com False jiwei-0418a-14589-openshift-gcp-pd-csi-driver-operator jiwei-0418a--openshift-g-pjn24@openshift-qe.iam.gserviceaccount.com False $
Description of problem:
When use selects "Use Pipeline from this cluster" oprtion from Add Pipeline section, then Create button should be enabled but due to PAC validation the Create button is disabled
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
Always
Steps to Reproduce:
1. Go to Import from Git page 2. Add repository https://bitbucket.org/lokanandap/hello-func 3. Select Use Pipeline from this cluster in Add Pipeline section
Actual results:
Create button is disabled
Expected results:
Create button should be enabled to create the workload
Additional info:
Description of problem:
IPV6 interface and IP is missing in all pods created in OCP 4.12 EC-2.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Every time
Steps to Reproduce:
We create network-attachment-definitions.k8s.cni.cncf.io in OCP cluster at namespace scope for our software pods to get IPV6 IPs.
Actual results:
Pods do not receive IPv6 addresses
Expected results:
Pods receive IPv6 addresses
Additional info:
This has been working flawlessly till OCP 4.10. 21 however we are trying same code in OCP 4.12-ec2 and we notice all our pods are missing ipv6 address and we have to restart pods couple times for them to get ipv6 address.
This is a clone of issue OCPBUGS-19418. The following is the description of the original issue:
—
Description of problem:
OCP Upgrades fail with message "Upgrade error from 4.13.X: Unable to apply 4.14.0-X: an unknown error has occurred: MultipleErrors"
Version-Release number of selected component (if applicable):
Currently 4.14.0-rc.1, but we observed the same issue with previous 4.14 nightlies too: 4.14.0-0.nightly-2023-09-12-195514 4.14.0-0.nightly-2023-09-02-132842 4.14.0-0.nightly-2023-08-28-154013
How reproducible:
1 out of 2 upgrades
Steps to Reproduce:
1. Deploy OCP 4.13 with latest GA on a baremetal cluster with IPI and OVN-K 2. Upgrade to latest 4.14 available 3. Check cluster version status during the upgrade, at some point upgrade stops with message: "Upgrade error from 4.13.X Unable to apply 4.14.0-X: an unknown error has occurred: MultipleErrors" 4. Check OVN pods "oc get pods -n openshift-ovn-kubernetes", there are pods running 7 out 8 containers (missing ovnkube-node) constantly restarting, and pods running only 5 containers that show errors to connect to the OVN DBs. 5. Check cluster operators "oc get co" mainly dns, network, and machine-config remained in 4.13 and degraded.
Actual results:
Upgrade not completed, and OVN pods remain in a restarting loop with failures.
Expected results:
Upgrade should be completed without issues, and OVN pods should remain in a Running status without restarts.
Additional info:
These are the results from our latest test from 4.13.13 to 4.14.0-rc1
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version True True 2h8m Unable to apply 4.14.0-rc.1: an unknown error has occurred: MultipleErrors $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-ebb1da47ad5cb76c396983decb7df1ea True False False 3 3 3 0 3h41m worker rendered-worker-26ccb35941236935a570dddaa0b699db False True True 3 2 2 1 3h41m $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.14.0-rc.1 True False False 2h21m baremetal 4.14.0-rc.1 True False False 3h38m cloud-controller-manager 4.14.0-rc.1 True False False 3h41m cloud-credential 4.14.0-rc.1 True False False 2h23m cluster-autoscaler 4.14.0-rc.1 True False False 2h21m config-operator 4.14.0-rc.1 True False False 3h40m console 4.14.0-rc.1 True False False 2h20m control-plane-machine-set 4.14.0-rc.1 True False False 3h40m csi-snapshot-controller 4.14.0-rc.1 True False False 2h21m dns 4.13.13 True True True 2h9m etcd 4.14.0-rc.1 True False False 2h40m image-registry 4.14.0-rc.1 True False False 2h9m ingress 4.14.0-rc.1 True True True 1h14m insights 4.14.0-rc.1 True False False 3h34m kube-apiserver 4.14.0-rc.1 True False False 2h35m kube-controller-manager 4.14.0-rc.1 True False False 2h30m kube-scheduler 4.14.0-rc.1 True False False 2h29m kube-storage-version-migrator 4.14.0-rc.1 False True False 2h9m machine-api 4.14.0-rc.1 True False False 2h24m machine-approver 4.14.0-rc.1 True False False 3h40m machine-config 4.13.13 True False True 59m marketplace 4.14.0-rc.1 True False False 3h40m monitoring 4.14.0-rc.1 False True True 2h3m network 4.13.13 True True True 2h4m node-tuning 4.14.0-rc.1 True False False 2h9m openshift-apiserver 4.14.0-rc.1 True False False 2h20m openshift-controller-manager 4.14.0-rc.1 True False False 2h20m openshift-samples 4.14.0-rc.1 True False False 2h23m operator-lifecycle-manager 4.14.0-rc.1 True False False 2h23m operator-lifecycle-manager-catalog 4.14.0-rc.1 True False False 2h18m operator-lifecycle-manager-packageserver 4.14.0-rc.1 True False False 2h20m service-ca 4.14.0-rc.1 True False False 2h23m storage 4.14.0-rc.1 True False False 3h40m
Some OVN pods are running 7 out 8 containers (missing ovnkube-node) constantly restarting, and pods running only 5 containers that show errors to connect to the OVN DBs.
$ oc get pods -n openshift-ovn-kubernetes -o wide NAME READY STATUS RESTARTS AGE IP NODE ovnkube-control-plane-5f5c598768-czkjv 2/2 Running 0 2h16m 192.168.16.32 dciokd-master-1 ovnkube-control-plane-5f5c598768-kg69r 2/2 Running 0 2h16m 192.168.16.31 dciokd-master-0 ovnkube-control-plane-5f5c598768-prfb5 2/2 Running 0 2h16m 192.168.16.33 dciokd-master-2 ovnkube-node-9hjv9 5/5 Running 1 3h43m 192.168.16.32 dciokd-master-1 ovnkube-node-fmswc 7/8 Running 19 2h10m 192.168.16.36 dciokd-worker-2 ovnkube-node-pcjhp 7/8 Running 20 2h15m 192.168.16.35 dciokd-worker-1 ovnkube-node-q7kcj 5/5 Running 1 3h43m 192.168.16.33 dciokd-master-2 ovnkube-node-qsngm 5/5 Running 3 3h27m 192.168.16.34 dciokd-worker-0 ovnkube-node-v2d4h 7/8 Running 20 2h15m 192.168.16.31 dciokd-master-0 $ oc logs ovnkube-node-9hjv9 -c ovnkube-node -n openshift-ovn-kubernetes | less ... 2023-09-19T03:40:23.112699529Z E0919 03:40:23.112660 5883 ovn_db.go:511] Failed to retrieve cluster/status info for database "OVN_Northbound", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl 2023-09-19T03:40:23.112699529Z ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.112699529Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 cluster/status OVN_Northbound' failed: exit status 1) 2023-09-19T03:40:23.112699529Z E0919 03:40:23.112677 5883 ovn_db.go:590] OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 cluster/status OVN_Northbound' failed: exit status 1 2023-09-19T03:40:23.114791313Z E0919 03:40:23.114777 5883 ovn_db.go:283] Failed retrieving memory/show output for "OVN_NORTHBOUND", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl 2023-09-19T03:40:23.114791313Z ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.114791313Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 memory/show' failed: exit status 1) 2023-09-19T03:40:23.116492808Z E0919 03:40:23.116478 5883 ovn_db.go:511] Failed to retrieve cluster/status info for database "OVN_Southbound", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl 2023-09-19T03:40:23.116492808Z ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.116492808Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 cluster/status OVN_Southbound' failed: exit status 1) 2023-09-19T03:40:23.116492808Z E0919 03:40:23.116488 5883 ovn_db.go:590] OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 cluster/status OVN_Southbound' failed: exit status 1 2023-09-19T03:40:23.118468064Z E0919 03:40:23.118450 5883 ovn_db.go:283] Failed retrieving memory/show output for "OVN_SOUTHBOUND", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl 2023-09-19T03:40:23.118468064Z ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.118468064Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 memory/show' failed: exit status 1) 2023-09-19T03:40:25.118085671Z E0919 03:40:25.118056 5883 ovn_northd.go:128] Failed to get ovn-northd status stderr() :(failed to run the command since failed to get ovn-northd's pid: open /var/run/ovn/ovn-northd.pid: no such file or directory)
Description: During an upgrade from non-IC to IC, the CNO status logic looks up a well-known configmap that indicates whether the an upgrade to IC is ongoing in order not to report the new operator version (4.14) until the second and final phase of the IC upgrade is done.
The following corrections are needed:
Remove Hemant and Sparsh from integration-tests reviewers
Please review the following PR: https://github.com/openshift/images/pull/133
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-18003. The following is the description of the original issue:
—
Description of problem:
Found auto case OCP-42340 failed in ci job which version is 4.14.0-ec.4 and then reproduced issue in 4.14.0-0.nightly-2023-08-22-221456
Version-Release number of selected component (if applicable):
4.14.0-ec.4 4.14.0-0.nightly-2023-08-22-221456
How reproducible:
Always
Steps to Reproduce:
1. Deploy egressrouter on baremetal with { "kind": "List", "apiVersion": "v1", "metadata": {}, "items": [ { "apiVersion": "network.operator.openshift.io/v1", "kind": "EgressRouter", "metadata": { "name": "egressrouter-42430", "namespace": "e2e-test-networking-egressrouter-l4xgx" }, "spec": { "addresses": [ { "gateway": "192.168.111.1", "ip": "192.168.111.55/24" } ], "mode": "Redirect", "networkInterface": { "macvlan": { "mode": "Bridge" } }, "redirect": { "redirectRules": [ { "destinationIP": "142.250.188.206", "port": 80, "protocol": "TCP" }, { "destinationIP": "142.250.188.206", "port": 8080, "protocol": "TCP", "targetPort": 80 }, { "destinationIP": "142.250.188.206", "port": 8888, "protocol": "TCP", "targetPort": 80 } ] } } } ] } % oc get pods -n e2e-test-networking-egressrouter-l4xgx -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES egress-router-cni-deployment-c4bff88cf-skv9j 1/1 Running 0 69m 10.131.0.26 worker-0 <none> <none> 2. Create service which point to egressrouter % oc get svc -n e2e-test-networking-egressrouter-l4xgx -o yaml apiVersion: v1 items: - apiVersion: v1 kind: Service metadata: creationTimestamp: "2023-08-23T05:58:30Z" name: ovn-egressrouter-multidst-svc namespace: e2e-test-networking-egressrouter-l4xgx resourceVersion: "50383" uid: 07341ff1-6df3-40a6-b27e-59102d56e9c1 spec: clusterIP: 172.30.10.103 clusterIPs: - 172.30.10.103 internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: con1 port: 80 protocol: TCP targetPort: 80 - name: con2 port: 5000 protocol: TCP targetPort: 8080 - name: con3 port: 6000 protocol: TCP targetPort: 8888 selector: app: egress-router-cni sessionAffinity: None type: ClusterIP status: loadBalancer: {} kind: List metadata: resourceVersion: "" 3. create a test pod to access the service or curl the egressrouter IP:port directly oc rsh -n e2e-test-networking-egressrouter-l4xgx hello-pod1 ~ $ curl 172.30.10.103:80 --connect-timeout 5 curl: (28) Connection timeout after 5001 ms ~ $ curl 10.131.0.26:80 --connect-timeout 5 curl: (28) Connection timeout after 5001 ms $ curl 10.131.0.26:8080 --connect-timeout 5 curl: (28) Connection timeout after 5001 ms
Actual results:
connection failed
Expected results:
connection succeed
Additional info:
Note, the issue didn't exist in 4.13. It passed in 4.13 latest nightly build 4.13.0-0.nightly-2023-08-11-101506
08-23 15:26:16.955 passed: (1m3s) 2023-08-23T07:26:07 "[sig-networking] SDN ConnectedOnly-Author:huirwang-High-42340-Egress router redirect mode with multiple destinations."
Description of problem:
node-exporter profiling shows that ~16% of CPU time is spend fetching details about btrfs mounts. RHEL kernel doesn't have btrfs, so its safe to disable this exporter
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/telemeter/pull/460
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
As endorsed at DNS Flag Day, the DNS Community recommends a bufsize setting of 1232 as a safe default that supports larger payloads, while generally avoiding IP fragmentation on most networks. This is particularly relevant for payloads like those generated by DNSSEC, which tend to be larger.
Previously, CoreDNS always used the EDNS0 extension, which enables UDP-based DNS queries to exceed 512 bytes, when CoreDNS forwarded DNS queries to an upstream name server, and so OpenShift specified a bufsize setting of 512 to maintain compatibility with applications and name servers that did not support the EDNS0 extension.
For clients and name servers that do support EDNS0, a bufsize setting of 512 can result in more DNS truncation and unnecessary TCP retransmissions, resulting in worse DNS performance for most users. This is due to the fact that if a response is larger than the bufsize setting, it gets truncated, prompting clients to initiate a TCP retry. In this situation, two DNS requests are made for a single DNS answer, leading to higher bandwidth usage and longer response times.
Currently, CoreDNS no longer uses EDNS0 when forwarding requests if the original client request did not use EDNS0 (ref: coredns/coredns@a5b9749), and so the reasoning for using a bufsize setting of 512 no longer applies. By increasing the bufsize setting to the recommended value of 1232 bytes, we can enhance DNS performance by decreasing the probability of DNS truncations.
Using a larger bufsize setting of 1232 bytes also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 in which a non-compliant upstream DNS is not respecting a bufsize of 512 bytes and sending larger-than-512-bytes responses. A bufsize setting of 1232 bytes doesn't fix the root cause of this issue; rather, it decreases the likelihood of its occurrence by increasing the acceptable size range for UDP responses.
Note that clients that don’t support EDNS0 or TCP, such as applications built using older versions of Alpine Linux, are still subject to the aforementioned truncation issue. To avoid these issues, ensure that your application is built using a DNS resolver library that supports EDNS0 or TCP-based DNS queries.
Brief history of OpenShift's Bufsize changes:
Version-Release number of selected component (if applicable):
4.14, 4.13, 4.12. 4.11
How reproducible:
100%
Steps to Reproduce:
1. oc -n openshift-dns get configmaps/dns-default -o yaml | grep -i bufsize
Actual results:
Bufsize = 512
Expected results:
Bufsize = 1232
Additional info:
This is a clone of issue OCPBUGS-20104. The following is the description of the original issue:
—
Description of problem:
The recently introduced node identify feature introduces pods that are running as root. While it's understood there may be situations where that is absolutely required, the goal should be to always run with least privilege / non-root.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Deploy an IBM Managed OpenShift 4.14.0 cluster. I suspect any OpenShift 4.14.0 cluster will have these pods running as root as well.
Actual results:
network-node-identity pods are running as root
Expected results:
network-node-identity pods should be running as non-root
Additional info:
Due to the introduction of these pods running as root in an IBM Managed OpenShift 4.14.0 cluster, we will have to file for a security exception.
Description of the problem:
Cluster events search for message=\ , or message=%5C returns all writing image to disk messages.
e.g. "Host: test-infra-cluster-f5e3a8e9-master-1, reached installation stage Writing image to disk: 5%"
How reproducible:
100%
Steps to reproduce:
1.Install cluster
2. List events with message=\ , or message=%5C
curl -s -v --location --request GET 'https://api.stage.openshift.com/api/assisted-install/v2/events?cluster_id=2aa44b94-e533-44fe-9c0f-3b20a3d91b4e&message=%5C' --header "Authorization: Bearer $(ocm token)" | jq '.'
or
curl -s -v --location --request GET 'https://api.stage.openshift.com/api/assisted-install/v2/events?cluster_id=2aa44b94-e533-44fe-9c0f-3b20a3d91b4e&message=\' --header "Authorization: Bearer $(ocm token)" | jq '.'
Actual results:
All "writing image to disk" are returns
Expected results:
Only events including '\' returns
Description of the problem:
CVO 4.14 failed to install when Nutanix platform provider is selected.
{ "cluster_id": "c8359d4e-141b-45ff-9979-d49dd679d56b", "name": "cvo", "operator_type": "builtin", "status": "failed", "status_updated_at": "2023-06-29T07:40:47.855Z", "timeout_seconds": 3600, "version": "4.14.0-0.nightly-2023-06-27-233015" }
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
We need to improve our must-gather so as we can collect CRs on which vSphere CSI driver depends.
IMO they contain vital cluster state and not collecting them makes certain part of CSI driver debugging way harder than it needs to be.
Sanitize OWNERS/OWNER_ALIASES:
1) OWNERS must have:
component: "Storage / Kubernetes External Components"
2) OWNER_ALIASES must have all team members of Storage team.
Some unit tests are flaky because we check timestamps to have changed.
When creation and test happen very quickly, this might seem to not have changed.
https://redhat-internal.slack.com/archives/C014N2VLTQE/p1681827276489839
We can fix this by simulating host creation to have happened in the past
Description of problem:
Bump Kubernetes to 0.27.1 and bump dependencies
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
IPI install on azure stack failed when setting platform.azure.osDiks.diskType as StandardSSD_LRS in install-config.yaml. When setting controlPlane.platform.azure.osDisk.diskType as StandardSSD_LRS, get error in terraform log and some resources have been created. level=error msg=Error: expected storage_os_disk.0.managed_disk_type to be one of [Premium_LRS Standard_LRS], got StandardSSD_LRS level=error level=error msg= with azurestack_virtual_machine.bootstrap, level=error msg= on main.tf line 107, in resource "azurestack_virtual_machine" "bootstrap": level=error msg= 107: resource "azurestack_virtual_machine" "bootstrap" { level=error level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1 level=error level=error msg=Error: expected storage_os_disk.0.managed_disk_type to be one of [Premium_LRS Standard_LRS], got StandardSSD_LRS level=error level=error msg= with azurestack_virtual_machine.bootstrap, level=error msg= on main.tf line 107, in resource "azurestack_virtual_machine" "bootstrap": level=error msg= 107: resource "azurestack_virtual_machine" "bootstrap" { level=error level=error When setting compute.platform.azure.osDisk.diskType as StandardSSD_LRS, fail to provision compute machines $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jima414ash03-xkq5x-master-0 Running Standard_DS4_v2 mtcazs 62m jima414ash03-xkq5x-master-1 Running Standard_DS4_v2 mtcazs 62m jima414ash03-xkq5x-master-2 Running Standard_DS4_v2 mtcazs 62m jima414ash03-xkq5x-worker-mtcazs-89mgn Failed 52m jima414ash03-xkq5x-worker-mtcazs-jl5kk Failed 52m jima414ash03-xkq5x-worker-mtcazs-p5kvw Failed 52m $ oc describe machine jima414ash03-xkq5x-worker-mtcazs-jl5kk -n openshift-machine-api ... Error Message: failed to reconcile machine "jima414ash03-xkq5x-worker-mtcazs-jl5kk": failed to create vm jima414ash03-xkq5x-worker-mtcazs-jl5kk: failure sending request for machine jima414ash03-xkq5x-worker-mtcazs-jl5kk: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="Storage account type 'StandardSSD_LRS' is supported by Microsoft.Compute API version 2018-04-01 and above" Target="osDisk.managedDisk.storageAccountType" ... Based on azure-stack doc[1], supported disk types on ASH are Premium SSD, Standard HDD. It's better to do validation for diskType on Azure Stack to avoid above errors. [1]https://learn.microsoft.com/en-us/azure-stack/user/azure-stack-managed-disk-considerations?view=azs-2206&tabs=az1%2Caz2#cheat-sheet-managed-disk-differences
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-05-16-085836
How reproducible:
Always
Steps to Reproduce:
1. Prepare install-config.yaml, set platform.azure.osDiks.diskType as StandardSSD_LRS 2. Install IPI cluster on Azure Stack 3.
Actual results:
Installation failed
Expected results:
Installer validate diskType on AzureStack Cloud, and exit for unsupported disk type with error message
Additional info:
The TestBodySizeLimit is increasingly flaky. We need to investigate and fix it.
https://search.ci.openshift.org/?search=FAIL%3A+TestBodySizeLimit&maxAge=48h&context=2&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Seen in build02, currently running 4.12.0-ec.3:
mcd_update_state{node="build0-gstfj-m-0.c.openshift-ci-build-farm.internal"}
returns:
Those are identical, except:
Looking at the backing code, my guess is that we're doing something like this:
Or something like that. I expect we want to drop the zero-valued time-series, but I'm not clear enough on how the MCO pushes values into the export set to have code suggestions.
When displaying my pipeline it is not rendered correctly with overlapping segments between parallel branches. However if I edit the pipeline then it appears fine. I have attached screenshots showing the issue.
This is a regression from 4.11 where it rendered fine.
Description of problem:
When "Service Binding Operator" is successfully installed in the cluster for the first time, the page will automatically redirect to Operator installation page with the error message "A subscription for this Operator already exists in Namespace "XXX" "
Notice: This issue only happened when the user installed "Service Binding Operator" for the first time. If the user uninstalls and re-installs the operator again, this issue will be gone
Version-Release number of selected components (if applicable):
4.12.0-0.nightly-2022-08-12-053438
How reproducible:
Always
Steps to Reproduce:
Actual results:
The page will redirect to Operator installation page with the error message "A subscription for this Operator already exists in Namespace "XXX" "
Expected results:
The page should stay on the install page, with the message "Installed operator- ready for use"
Additional info:
Please find the attached snap for more details
Description of problem:
SCOS times out during provisioning of BM nodes
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://github.com/openshift/ironic-image/pull/377
In Helm Charts we define a values.schema.json file - a JSON schema for all the possible values the user can set in a chart. This schema needs to follow JSON schema standard. The standard includes something called $ref - a reference to the either local or remote definition. If we use a schema with remote references in OCP, it causes various troubles in OCP. Different OCP versions gives different results, also on the same OCP version you can get different results based on how tight down the cluster networking is.
Tried in Developer Sandbox, OpenShift Local, Baremetal Public Cluster in Operate First, OCP provisioned through clusterbot. It behaves differently in each instance. Individual cases are described below.
1. Go to the "Helm" tab in Developer Perspective
2. Click "Create" in top right and select "Repository"
3. Use following ProjectHelmChartRepository resource and click "Create" (this repo contains single chart, this chart has values.schema.json with content linked below):
apiVersion: helm.openshift.io/v1beta1
kind: ProjectHelmChartRepository
metadata:
name: reproducer
spec:
connectionConfig:
url: https://raw.githubusercontent.com/tumido/helm-backstage/reproducer
4. Go back the "Helm" tab in Developer Perspective
5. Click "Create" in top right and select "Helm Release"
6. In filters section of the catalog in the "Chart repositories" select "Reproducer"
7. Click on the single tile available (Backstage)
8. Click "Install Helm Chart"
9. Either you will be greeted with various error screens or you see the "YAML view" tab (this tab selection is not the default and is remembered during user session only I suppose)
10. Select "Form view"
Various error screens depending on OCP version and network restrictions. I've attached screen captures how it behaves in different settings.
Either render the form view (resolve remote references) or make it obvious that remote references are not supporter. Optionally fallback to the "YAML view" regarding that user doesn't have full schema available, but the chart is still deployable.
Depends on the environment
Always in OpenShift Local, Developer Sandbox, cluster bot clusters
1. Select any other chart to install, click "Install Helm Chart"
2. Change the view to "YAML view"
3. Go back to the Helm catalog without actually deploying anything
4. Select the faulty chart and click "Install Helm Chart"
5. Proceed with installation
The kubernetes-apiserver and openshift-apiserver need to be rebased to k8s 1.27.x after the o/k rebase is completed.
The new test introduced by https://issues.redhat.com/browse/HOSTEDCP-960 fails for platforms other than AWS because some AWS specific conditions like `ValidAWSIdentityProvider` are always set regardless of the platform.
OCP Version at Install Time: 4.11-fc.3
RHCOS Version at Install Time: 411.86.202206172255-0
Platform: vSphere
Architecture: x86_64
I'm trying to verify that the IPI installer uses UEFI when creating VMs on VMware, following https://github.com/coreos/coreos-assembler/pull/2762 (merged Mar 19).
However, the 4.11.0-fc.3 installer taken from https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.11.0-fc.3/openshift-install-linux.tar.gz still seems to use BIOS.
Reproducing:
1. Run openshift-install against a VMware vSphere cluster.
2. Wait for an OpenShift VM (bootstrap, control, or worker node) to show up in vCenter.
3. Go to the VM's boot options - the firmware is set to BIOS instead of UEFI, which was supposed to be set by default.
Description of problem:
Bump Kubernetes to 0.27.1 and bump dependencies
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
OCP cluster born on 4.1 fails to scale-up node due to older podman version 1.0.2 present in 4.1 bootimage. This was observed while testing bug https://issues.redhat.com/browse/OCPBUGS-7559?focusedCommentId=21889975&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21889975
Journal log: - Unit machine-config-daemon-update-rpmostree-via-container.service has finished starting up. -- -- The start-up result is RESULT. Mar 10 10:41:29 ip-10-0-218-217 podman[18103]: flag provided but not defined: -authfile Mar 10 10:41:29 ip-10-0-218-217 podman[18103]: See 'podman run --help'. Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Main process exited, code=exited, status=125/n/a Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Failed with result 'exit-code'. Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Consumed 24ms CPU time
Version-Release number of selected component (if applicable):
OCP 4.12 and later
Steps to Reproduce:
1.Upgrade a 4.1 based cluster to 4.12 or later version 2. Try to Scale up node 3. Node will fail to join
This is the downstreaming issue for the upstream operator-registry changes. Upstream olm-docs repo will be downstreamed as part of later docs updates.
https://docs.google.com/document/d/139yXeOqAJbV1ndC7Q4NbaOtzbSdNpcuJan0iemORd3g/
-------------------------------------------
Veneer is viewed as a confusing and counter-intuitive term. PM floated `catalog template` (`template` for short) as a replacement and it's resonated sufficiently with folks that we want to update references/commands to use the new term.
A/C:
Description of problem:
When we delete any CR from the common OCP operator page, it would be good to add a indication that resource being deleted or atleast to grey out the dot at the right corner as the user perspective.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Steps to Reproduce:
1. Go to Operators -> installed operators -> click any installed operator -> click CRD name from header tab -> delete any CR from list page using kebab menu. 2. No indication about deletion, user can do any action even after deletion is triggered.
Actual results:
No indication about deletion on kebab menu
Expected results:
grey out the dot and display the tooltip about deletion.
Additional info:
https://github.com/openshift/console/pull/11860 is not fixing this issue for operator page.
Description of problem:
This ticket was created to track: https://issues.redhat.com/browse/CNV-31770
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-18464. The following is the description of the original issue:
—
Description of problem:
Hide the Builds NavItem if BuildConfig is not installed in the cluster
This is a clone of issue OCPBUGS-18641. The following is the description of the original issue:
—
Description of problem:
vSphere Dual-stack install fails in bootstrap.
All nodes are node.cloudprovider.kubernetes.io/uninitialized
cloud-controller-manager can't find the nodes?
I0906 15:05:22.922183 1 search.go:49] WhichVCandDCByNodeID called but nodeID is empty E0906 15:05:22.922187 1 nodemanager.go:197] shakeOutNodeIDLookup failed. Err=nodeID is empty
Version-Release number of selected component (if applicable):
4.14.0-0.ci.test-2023-09-06-141839-ci-ln-98f4iqb-latest
How reproducible:
Always
Steps to Reproduce:
1. Install vSphere IPI with OVN Dual-stack
platform: vsphere: apiVIPs: - 192.168.134.3 - fd65:a1a8:60ad:271c::200 ingressVIPs: - 192.168.134.4 - fd65:a1a8:60ad:271c::201 networking: networkType: OVNKubernetes machineNetwork: - cidr: 192.168.0.0/16 - cidr: fd65:a1a8:60ad:271c::/64 clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 - cidr: fd65:10:128::/56 hostPrefix: 64 serviceNetwork: - 172.30.0.0/16 - fd65:172:16::/112
Actual results:
Install fails in bootstrap
Expected results:
Install succeeds
Additional info:
I0906 15:03:21.393629 1 search.go:69] WhichVCandDCByNodeID by UUID I0906 15:03:21.393632 1 search.go:76] WhichVCandDCByNodeID nodeID: 421b78c3-f8bb-970c-781b-76827306e89e I0906 15:03:21.406797 1 search.go:208] Found node 421b78c3-f8bb-970c-781b-76827306e89e I0906 15:03:21.406816 1 search.go:210] Hostname: ci-ln-bllxr6t-c1627-5p7mq-master-2, UUID: 421b78c3-f8bb-970c-781b-76827306e89e I0906 15:03:21.406830 1 nodemanager.go:159] Discovered VM using normal UUID format I0906 15:03:21.416168 1 nodemanager.go:268] Adding Hostname: ci-ln-bllxr6t-c1627-5p7mq-master-2 I0906 15:03:21.416218 1 nodemanager.go:438] Adding Internal IP: 192.168.134.60 I0906 15:03:21.416229 1 nodemanager.go:443] Adding External IP: 192.168.134.60 I0906 15:03:21.416244 1 nodemanager.go:349] Found node 421b78c3-f8bb-970c-781b-76827306e89e I0906 15:03:21.416266 1 nodemanager.go:351] Hostname: ci-ln-bllxr6t-c1627-5p7mq-master-2 UUID: 421b78c3-f8bb-970c-781b-76827306e89e I0906 15:03:21.416278 1 instances.go:77] instances.NodeAddressesByProviderID() FOUND with 421b78c3-f8bb-970c-781b-76827306e89e E0906 15:03:21.416326 1 node_controller.go:236] error syncing 'ci-ln-bllxr6t-c1627-5p7mq-master-2': failed to get node modifiers from cloud provider: provided node ip for node "ci-ln-bllxr6t-c1627-5p7mq-master-2" is not valid: failed to get node address from cloud provider that matches ip: fd65:a1a8:60ad:271c::70, requeuing I0906 15:03:21.623573 1 instances.go:102] instances.InstanceID() CACHED with ci-ln-bllxr6t-c1627-5p7mq-master-1
Description of problem:
Upgrade from 4.12 > 4.13 will cause the cpuset-configure.service to faile, because `mkdir` wasn't persistent for `/sys/fs/cgroup/cpuset/system` and `/sys/fs/cgroup/cpuset/machine.slice`.
Version-Release number of selected component (if applicable):
How reproducible:
Extremely (probably for every upgrade to the NTO)
Steps to Reproduce:
1. Upgrade from 4.12 2. Service will fail...
Actual results:
Expected results:
Service should start/finish correctly
Additional info:
Description of problem:
The cluster network operator crashes in an IBM ROKS with the following error: 2023-06-07T12:21:37.402285420-05:00 stderr F I0607 17:21:37.402108 1 log.go:198] Failed to render: failed to render multus admission controller manifests: failed to render file bindata/network/multus-admission-controller/admission-controller.yaml: failed to render manifest bindata/network/multus-admission-controller/admission-controller.yaml: template: bindata/network/multus-admission-controller/admission-controller.yaml:199:12: executing "bindata/network/multus-admission-controller/admission-controller.yaml" at <.HCPNodeSelector>: map has no entry for key "HCPNodeSelector"
Version-Release number of selected component (if applicable):
4.13.1
How reproducible:
Always
Steps to Reproduce:
1. Run a ROKS cluster with OCP 4.13.1 2. 3.
Actual results:
CNO crashes
Expected results:
CNO functions normally
Additional info:
ROKS worked ok with 4.13.0 This change was introduced in 4.13.1: https://github.com/openshift/cluster-network-operator/pull/1802
Description of problem:
Duplicate using in log message.
log.Infof("For node %s selected peer address %s using using OVN annotations.", node.Name, addr)
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. code review 2. 3.
Actual results:
log.Infof("For node %s selected peer address %s using using OVN annotations.", node.Name, addr)
Expected results:
log.Infof("For node %s selected peer address %s using OVN annotations.", node.Name, addr)
Additional info:
Description of problem:
Installed and uninstalled some helm charts, and got an issue that the helm details page couldn't be loaded successfully. This issue exists also in old versions and is aligned with OCPBUGS-7517.
If the backend fails to load the frontend never stops loading the helm details page.
Version-Release number of selected component (if applicable):
Details page never stops loading
How reproducible:
Always with the Helm chart secret below.
Steps to Reproduce:
Unable to reproduce this manually again.
But you can apply the Secret at the end to any namespace.
You can create this in any namespace, but because it contains a namespace info "christoph" the helm list page links to an non existing URL. You can fix that manually or use the namespace "christoph".
Actual results:
Expected results:
Additional info:
Secret to reproduce this issue:
kind: Secret apiVersion: v1 metadata: name: sh.helm.release.v1.dotnet.v1 labels: name: dotnet owner: helm status: deployed version: '1' data: release: >- H4sIAAAAAAAC/+S9a3ObTNIw/Ff06v74OgkgKxu5aj8YYiEUiUTI4rTZ2mIGDEjD4REgGe2T//7UzAAChGzLcZLr3r2qrorFYejpc/d0z/y7H1qB07/p21EaOmn/qu+HD1H/5t/9B3+bpP+ynRhFuWP3b/ocww3eMdw79vqeG9xcj25Y7v3H4XA0ZJkh9/8z7A3D9K/6yHrNW7aDnJQ8T34kcOvHqR+F/Zu+FCaphVAPRkGMH+pf9ZPUSrMEA11+56ofRqmDL30PjSjb9t7Ld/c9K457ftIDmY9sP3T/v9591Nv5zr6Xeg692kORm1z1tll48z38HkaQXOgB+IHio/fu3UOEULTHd+UodXqpZ6W9HH/iM/l44IRpb+8j1Ns6cbRNe9/7d9utFFiu8y1D6Hu/Z4V273u/usJbcPP14eF7v5eFqY9qsPhJNcn3va8hdLrvXdHP+3hA+mXg9KwsjQIr9aGFUN7bRgg5di/K0vf9H1d96FnbFNM0cFLLtlIL/92m+87ZJhTjzHvmPXtCh9vexEFBj4zVS6MCLjw5SoUK5ciHFn4n6V/1N06+j7Z20r/5R3+L5xs4+HLx0X9e9a3YV6sP77j+Vd8KwygtBrj5N4X9X9kW9W/6XprGyc2HD66fehl4D6PgQxQ7YeL5D+k7z0HBO/J08qH4Z+sgx0qc5IMd7UMUWfaHrWN7VvqOfv8dmWjXtfepe+j/+HHVRxHc9G/CDKGrfuoEMbIIl/2jQl918YP89f5u+T59xLikOO47gySVRBRIwnBpao/I0GU026DDUhsebHGcAIEfPSxiEwzUXBKGXxV14Ro6v5dEdJDEKWtpjxtLG4bSkl+BnOcsTR1IEyUyl7xvaygxBT4BnH2YCXxua9cfBTfeGTm9JonT9WyQ/k0SUWZwj6wprlwpUHa2OES2MMwMjUWSf5tJE3YkCUxqBqMEiKOB4MZfwUBB+DuGvnAdbcRCn78zdT4BA5Sa2pCRJnYMxL0LA3UPBlNGEqZjGE6nQBuHpsqzQNz7kjjOTOHWX2qsZ3LqwtYek0UwXlvMKDB9ybW1IWNpe9cWPVTOlc5b3gGdT0xdQTOf/wYCGbXmHMOcXwOO3QNRZczlvoQxJt9f8gNLe0wkcYokccza4ig1dCU2uHECJhsXknmqG0kcsbZw/YXSSM1MQou//73/46qDuP/yHBQ72+R9GqM2fRVkBigzl7e+KY4YEKjMLBh6QFv5ksBi+v7NyfmNqZmerT0yTV4gz7kzZHpgoiKYD0MgjnxD22cgGKfmasSZ+jS3NEwPdiSE6d9mSx6BYOHOdHYkuHhsxjVFNbC0IZKE6QYMlMzUFxkQx76pPR4k/zZ90JkvlqgmYDk8WMJobYnj3P4cuc4gcWcbOTL0KVPCgp/F1y1tuAYTddOYVygjIKprWxzl98fxsxpssenfZgvO82C4yBY6v1cDNYcc2gGf8LoHJ7eZNVB9U59mmMYwH8YgJ/M8WNoo++rzf3Pys2N8kiZjlvJnEx8Ybiw7syBljUDNMbymPs8s7dMOaOPM0GxkCqzv3BfzRlM8Fw9yq2zFqbkdoLW5PMIIBjwCoRxZmsnMArSbDaYsCJUYaKuPkljI0W2Bf224IfA8QQ/IqYmpyQwYTOeGNkVgMi/54xxOiIxSfPAxCOTExnxQpzHmkaXkzp7GbQxCmTG04drsmPs9GYMv+LSYC4Er+nKev1GK8Xlffl+ntKDPjlkwWbhfntE7X5a3mRqME1tTD+V4hdxgWn6kcFZyUcj2kDE0WPJoIbeEv8/ILTFSMAoffPd9bgWoUzdrhvbIYl4xQjUG4iIztaFnBI+I6oTYgyLSavxZ6KHhDopqBjkvNsMF4TN7fffF4lBmfo7cBRlL+Qy4YWBp8AvQVMbQFM8W7z4K/q1LaFfQo1PWKh1yTeYrCXxS8A15XxJ4Qq/udx89I1ATmBPe+CSJwxgECgLhwpXpnA5QVNdf3cglenDCs/bn6Isk3LrSRNmR6wL5xtbShptSJo/0ovp6FhTvCkPyHJEBAtutK4kE/o/S5KwNojRdNedZ0YXaj8jU7+p8UOHe1pW9rS8ygm/h1lfE0dri1Jzam5Xf5K8TePe2LkcrTl3DQGWOcPON6zU81GTxwnFbOkoS+AO29wa3KunIODqPx55Y+oLSQLTjih7CrWvr0/jst0M1t4j8VrDmpmZQvHfwNgzUoK2vT+cj70CoIGei3Fm6VMCN4WpcP/sNgxtltqhe23dKDP2WbilwDQdKbugKMgebNh7uC/wU/CjvbH26NlWZgcGYMTV7WKNLAINRevyNYUxjECw+SndUp6wGSm5q41QVx2HFm0JT/jr4i/Im+abqAVXxzLzQXQX8T+kPi/o89/ZkigyNXRkazGxdRqsA24DxgfK8eoDiuLA5dfhr987pa13eG5rcgkVNILc60qrGt3DAe5jfztGrC15w/juFrxS5eLyzPBRiP/Dx3tTk3NQXDbjgRE3AWEbGYIrqfP6snHWOyZ/wFsWjmtnr6MtRTxpddEYgNNo4I8/bEz6RBI8BLPKAtj/3bkz0lsBnWP8R3/iz+zRcwq07W5bzvPWVu9HqfqOudFZeLdTpSlX5h9V4+m25UT+rglSDC8sCheE8fmQG+3K2zi9gMPo/2N+QJnsX6urOFldN/dqtGwrdpB4qOa1dK+WMjDFRdpo2ToHQBccUQW7EwkBGMO+0Pye4sSfT2ORsBMPSvt3ibwyhuPoo3ck7EJixyciRoQ1Ds/V+t+2nZj+w4hdYflNU90AcDWfBeA/FRxwtMNjaryZTbOVzW0SY8ggEY59EDxjqYLy3VBPBUI4Bd/1RmhiPQlBqnxJi1oO3cWrqimeKY8a4Jxx5oWeHrXRh0Q+llaaS1/luwzPfuwB7b1gahFtqkYKjF4I9ZiCiNY6QAHedwfp8D5H7IDD17zFQbEjChkaFm6w5zhBz397Up4yFuSLkc+xNw1CJTX1+ziN5AUXtKuPSIqmh83EVJKwjMi2Yj7j5Ii4dmEYAKwQsssXxxtAVjzpBzzorjYCZOHAFmhtC0f1ucnT428pi0eX0dBhkNTO0aWJq2LFWH7uEFwxUBk6w80dY0JU2pwZQ8jelcvJAMNzZImXziq3E0hEf7U1teLBEFBBHLR8xMEChyal5gy2EW1fb14zXJGkqLGFKDLS0jv9WV4DFPeo0rqMySVA3QP7sNmo9f+sXYvFRCi9yKv3mt/nRydhYVLXHzQrjQ8Djy3tTm2MnJoXio2eLqwwOeGTkwzXgcBCCMhwQaAfeLoMXqe6ETI740dvKjo79sVCuhdjMT4xzpZKwMqUq6aiUq2BSKp2n1NCVtXXXUhNP8+WBJCIGyg5unggYShU0URDQ+QQ7bSXPt4MaykOnMLw2WPkqHJ0jgv9D9OVp9Y0ySydBV0WjplNwer/hPNbU3JdOA6dgWuycJYYRMQvi6I5jgHVPvjkDf3zGQSHOtGdpw7rRazkI/MDUWk5AIaNPmY/Cofva1lk1epDEHSzxOZl6ILBRN//xOxgqh9Mx8P9M05E+BvAtPUC+WZedBf5+6cjY4jg3OZVZiaPcFloOcUUbbEaUncGNkvI9bK5scbQG3L6VFDjiveX4VSary9mpwZqbehF46E3a1Pmkwzk8M37DDD/Ou4KizoCs4rfE0k0EAvUAWT4H3BR1wHzyTIO3X+bcVvADEXEm5s2BjM25by5P+Xt+L70G754pYh6QD9i9MoIOfiF2QInfai4d3+xw3O/yroD9aX2jZrZ/yq+mNuR+Bl78/pflifv2Gr5BIDRFYoNP+aW697OwKuF0B17MH81v2cEosTUW3WsjFoTK4av7NP886WrW/KQuHVTTqx6c8ImlyZ4toh3w2X19nFM9h20dgW9h6Ep0GoBV+G6Oixqub1Yf42JeC80dmKipuWJ3tjZkYN6lJ19Otzbe34rfnImSG6uGbbs0wD7ylu4xMBg37HUHnNfdibYmf8FnfYcLQj//Z3mK2MLA0mzZ0G9P7cul8chr+Eh/PV0qnL7Q5+kO58gKdpJuHSs4F6L/pnioHT9S/2nVsQhUD/Gb4/3VYsrSX5bIIvFoa+v8AnCPsXFMuCaAkz3wOXLtyZR9WVJlG2Wpc0lCJZzubF1Bbc3cxjgMRiyOsp7E+JiO9RfGNAOqSLqEWUYwNOMqnf0KWCH2io/LMx4MbGSPsVe+oONMlP2ZKCXHUkCWzfRpaOrKgi7XX79ASxR0C5WkS/vZ4hF3tqjmQENZkUR6KpKtj8mY+jS1tGGhLY/WzNJwZCqzMFDHpmgjtTH+sOLiF34nBqGMjIGamyt1Y3LqqvFdxO+wN+Esn0ppn+ITTOaZxanD2tLR1tQxTPu00uZkrJeO07Bo3GUp9+5xTVE92GKFt8+LlUw9kYB4T7UIgt+YusxUONnI/IIjli+g1mz1qnk9//23n7PBjT9Ti2sSS15fXjm5d9/MZJHvPs9Pa+O3zKOB//oSXOPbX33+0+y49Ec0eHcERPUrlttZ8Br4O73A0uM4s/yeONudD51nsrX1ZfOqFGPxF0ua17J29gTtO5YOj7gu56CS5Ytq2bfKtq2fh6e7XKTb/tSzORdmseh7HV6cXJVD/fOqv7NQ5pBqPFJPQcryojB1HtPP/rYsj3NCCyDH7t+k28zBP3flHeLnLYmfd2+5J7WHNwNSbYivJbEF8Y2qqq9/1c8SR6F1fPLxiQcLJc6Pq36UpXFGShs3fmj3b2iZ5fFbV/04S7ylA7dOSsH5gS8hVL901d86DxU4MNo67yhIWyeJsi3EU6fPJam1TbP42zZaOzDt3/StOMYgbv3u6sSytNDZOSiKne2HhPPf1T7jPPZ/XBVVrHgSterJb1v8QupTvFfIJRO/6gdRFqbfrNTr3/RryyLlotcHPPHaAP3/+Z/eccCeG/U8Z+vgb9fI5IS78TYKqp+P6dYSojC1/NDZVijwQz89vYr8nRM6SfJtGwEHA5zCeBnBjUNoE0fbtEAQqarEvxtVlOTOVfHcJ+YTQ8BPIxih/k3/XvjWv+qn1tZ10m/VI5gxt45l+43v4pHE4qsFeqqBjwBsHYLnpH/DdlCZ+LgNrFOWrkNQgpwiWqVqCRi3D5h4TjkOPL1kO0nqh4TAwm3HK60v+mHiwGzr3Nmuc+9sg+LVbxHyYd6/6SuO7W8xJ5JK22OhavUk9s5t1yGTLpTxfR5jlAsoS1JnK2HU7iKUBc4c81SFBHotqYTGRRGwUCm8X3fOduvbTnWbyPhRtAtAsLT3iSlIKQjQcwISMpSLRr5yMDgPAe3O/+rf+tZEYeDnaDfj4gPgrlPIyZGpsd4sGOVmPtrAYBzYArOX81H1XrWY01xn9LGpaATW5ULNOnKd/eniElHrS+mjJEx3RhAjY7DoXITCbo0xmMZwQtxdArdSVPzBnGcscVUGkKQyVZrYHlaptvjJLYLTXelaSP7+NFH+3Dy6FsROFt7OzG2c+HCg5KSq2N+7UjAk1br6cv/k+11zLioHd6Z/2ZxnPn9XVsNifIECZ7CsjikSKjNf6oZpwiTn8GHjoL6bvvWFR1JphJ+TQpmB2NTn0rkxC95REOTk3NJ5khwi7yLFMwnsY0aaoJ295AcGqY5WdmVF84wrTe3ZuZxcbyT12nMtEiC/godL90CaTBEQxwwO1TH9GhXaP8mvx8rKC3hWmPqAG2HeyDq//5xsFt+ccUoMORzGruic7kafVwIv0GqeV/CaPo1/H6+pyyWVtYmlK5KtSUX1vRzb4ih/gr/Owk8qAX8X/Bs7tmllot/9bseifMVfHVVNvw3vvGeLLpGD2WY4VgV+X8EgmjHmJTCQXDiZ7qxAXdsCv7H0KXXzw00mud0wPpzVt1OySGrqHqOIKL9knlo+PdiTaQwC6M+wbQjVBAhT+yxe6fs49F/DAO1pGobI2+cSBklUYhjQin9nyQ8sXYks7brsuJhY+oLyIdXjLORWrqHPqXxNpjswWLhmMMqbHRyX82pHheJTeqbgGxJ+ER0AuCml2Sv0x5N6cTLPrXJeoppPB3P3hcUsdRo0Fgqep/lFtv/nZOJCH6Br7s+Of57uddhZyKlV50yjCvZl+DrhCSMY7YCoesD/ORwoolqmMrH/Fz+BC9fS5y6WH2pTG1XAQzDAsjNFlM9UD3KYPliupGfp+/C0/1bot/r3/izfqKfzp37UsSjmUOiM6Wkl9gvwsYie4reL9U+5mHSR3QlGPrUJr7E78t7U5NgMEOapgSWqOcXR0c+2OZQAgffNJe1aKPUS4IbrUtfawrV7r43wuzEIzB0MWJriXUcuibUm84+zfLQBnHxoFf2tAcfsjFqB04wWfwVgME1nh0Um+yOq9ybzssMK25608PfT4wKcxwBt/0t0IO3++LO83JL/CAxgId9FvFl0SxrByoUT9WCJ6uYZO1SHF4FQTQv73iULiSRM7wAnb029uL+c2m+kc5vdLK/Sszyo41kSppmtPSYSYn4K5yuMR5JSpZ0A1BbuXVuXkSnwOxA8DukCMl38/XbPuFNG2RlcimCxEDz9A3qk0fkg/Jm4XW3wJunWdWFB45nPy2Awxfcz7LcBjRZDfPX5yJ4oe3iIdjNO2RmDeWupVt6B5ahW4CcVhbNY5zA7WbjmZlxNv4xHtFBXr+sO+6Hw8w6zgXrAc51pRUGyTuBMSzhhPoxskU1e4V8jEBoX+I48kIJxDoPx8GfxroRTZGoEH6RADMcPpva4eRnOT7taTXG0hvmIMXR5C/NRDGi8n5IlkyXbKkYZbUzNjEGwSvG3LX26AwGLQLhI7WCcWxqJi9OGvs9fFVMeix4vi12O+Yqfi11EGKiI4HHZKOJ0sS0F4iKT7tgdDLAdHRJbVi1bix5jT/jDV//TrqOLltpITt6BQEZwohR/m7GJ50vHqDqNZ3q9A4ZtFI0/gdcLc0FVbqlbj7w0/un0P8plLRyzYzlWNySvRX2yTWM3AHEUkPyXLrX7STpzT40ek/wpHTjN8XcwDz/93Em+KAahgkxOzV8T792Haopljc6L35mkGLiEg8S55VLfK3IZdb64TP/XaPhmeqghr2+jj9bE/9R5pvg7sDSbhQEdY8axHgiwj8LWZbOhd2D+6TU5oqLMRl2VOuU35YCxPNCxCL9U5T409z2YIoPuZPFWOD2Y+pSzNKJXsH4agAFppAlng+rbO2nAs0bwGEPOIz55tSStz49/L1kCN3yNnhdHuT2ZX5SDMfRpbugb/yd1ejV/LJukMeHt+fZlOObozgjUT7mr4fU1dpOUZ/z+nBb2z3we+1TYT8jMnPeAz39bsLdlSd4vidsuWAc4E1tPd4B7RIZ2/Qx8z+WlFVSWr5kkN2NUuUublMbIERioiUnj7AJP6uac3/kiXdVcr6o1vvwFcjJEphYu4K5dKI42dD0THY744HcWLT5/u7wUKat6LR+8MMdyoTwc7RaRidX9eE5wYmjX7j0jL0p8/Np4aYjsib2DpPwV7qg84tioKGfUFml17UU5llet2Z2Rpaqc9/J330KWL8/JEhpCztvZZKcpTLPjHC6B/0U83ZDxRnnsz+Prcnl/zr5OQaDQkshq7UZBMGBjUPhtRSf+L8XTC8t+/3fgT8S2BmEZdY1AjQzdjMFAomthIsogp65tHfMgjiGHyBiQOR6etyuX5vDqMpsi56K8/y/J5z2Hy5mpb3CsnmB7S9Yhib2RLoD5Rba3seZ6UtJ7Sa7z7fJ1F/gtLXjJTnfeDgaIcZb8ulUC/dvw2PzuX57XPv8hPD1Xbv/TOu51tQAvt6+08V3N7An2iwqf+U7m2+XpNCYmMXKA5wvarQb+ZXh+GU5e8Ny53P2T7z9B5+AxtgO13mB8YY756Me+mM879YKZm5pK8pqSeALTRTlQGhc3d7kzuFFK8p5lc6fP7klNBInNPzGztbFrNegeZlyzsXKmecjQHhlLKHL4Z++znoH92/Go3VgZmdo4sUX3Vfm3dmP5XyJPrY03VqPZHssLjjt/6fptE6/hvEW769QSVQ9MlKiIL9Zn72tjskaun6Xl5TkRskZyGW08GE49Z/mWMkQa/C+hxdbWpggGQ0TjMlTGfztbHB+swRzHzyRvM6Mbp6QtnO7K3UNnmprZAcoBNyR59pluejBAtJZRq8vgK/KlgZrZGzMHHPOXWAujGwi8LZ4x75sBCrGfPdP5nSnU12Gkc/dx3J8a+u1Oqq3vmEvWA+I+tTh1iOkAJip+x7P06WGmoWym3aXleEAb72fa+NCkEYtAaMYwGGUA01VgOUOfxiT+1OevoN9TzeoXrZU8VSf8nA4rW+WKuEa9JrFuME4sPca+B8mXPLnHWI5p/oiey2HCy+b3MzmK19QdP4enO7KOSOJp9UDqEqhP1LJrw2YutagvbeUs6jhGpsDHwOe9xn5zwlvVi9GNJqpNsf54vdjjruR3WmNP4Stw2Yk7W/z0VvWHr8lz/5ZamBfJV7FzKwiVw0ty10/Bc64m5omaCw5wjyzQVBkMSMzwx21Oe/OTX7GGZIujnNQCDKoNN3avqFcpcXf4477Q5LhxyJvXUgxIq6tncqvdcXOR1/g1HRvA/XEfsgOmt+c3skYJxJFnTuY7unYzIpvUNfyQ8FU82LWpzx+R4YWmbGDOU3iWjQ3lDrQ+hdah4PnPaK8GZ2kKtpMezHkPx8RwsCA5C/xMfVNDmP+c34n9P4NT9/ZkvrO5UW5xjztDw7zN7zBNpAHxMRlTK2LmJ/y+0w3nGhv1tPh/Wp4acYYmL10ve9rHs0XPswX+YIkjFogLUs9q4jkG40QSx2sYjA4w5/eGPiUt5IY23JA8uKgWrfS8ZGqPqSTSEydgp09wgZ2l9ezf7EDNYYA2f6wPcmAiGE5jfK/wRfKyT6HY7iWT7lAKxfHansybtcH0vvuF1Ko2T0Gw9LkLRDWwBd4H4jiz8k2B0/ZY9HnM02W9iSmUa/i1+p5zcn+6iViD/8r7+F+yJcBV/8FHzWNwlLvbz/O794F93OVBym+z+426ku48BETGLQ70+LJYNtmgWHItWmiKEoRg4ZbtooRE4p2r5cOviqrYX5opKmaGRVH4FEtiMySSxFEm3ZGUyD1NiVx/Efz5WhrzualPEdTVGHIIi+GXRaAmYACL9ouT+yXrp4a+oeHMZEPKe4o2KNr2xI0P5rL4Rs6TA2+kstT3NjbBoCrZz0ytKKcT5dzUSOrZa5fi0jCBtJXuyVjiOJPG83xZazeSRBLCkzDL1D0Gm1lDIwfzhJYuM6QFbaweytaB8pAfilN5BzSW4NrQ565B2NZEEIdNrfKlSq1OVAx/WXb9UQh41xCnHuTcL4Cb154vS5z5DS1Nl9IaTC7QNykdi61KuGdL9vgsKaVR829L5V5Rp5qiTh9UdTonBxUVBz3MdPVA58uib0tFXTHq8n4zlpXlbTrTilJvn90bunzA6tj8zGxWd+P7FWt/W22a3zM11rO0/Wh6p8qLFZqTed1GXzRtxEqi7IGCxrboxTA/lp0b4caF4vhQtOrgMQ/0erGhemOjezZ1liyh5Uwvrom3FM8ilRGrPBhKvE0kcZQX9F+TMLqWrsBqpmwlMpY8dwzDUWLeR18MkXcdbZyC25jyCtk9ivJRdbiFz5/Ccxu7hnj7haoZIk9DSZwOpQnvwYFMQ3qBSe2yXEJQtMXqUVVU+UHZoLmi3WaCG62lfO4WJfrlZv0Ul0UrBAhUrKIYkJclNSs8D9JuUOOd2PRpW5TNjbH5wM8Xz1B+rFphMA5/nwzUWw9+VgYO1n+4DFRtQCF+3yv5YXQ6f8Wl+/09MQbdCW7UPOxFdaku5SNqV1AGB4oHxEd3JvDYttDWOm6c0eX2MqWH+YHwBU2hTKqSZxIyf3WxLPBDQ2MToWi77zwwpzpwpNle0nmgTClzuhy1ZY7apgKGfSyY2uPOzslhZ540UUjalyyniapncK5LZXBVtrwQWTqWLVV0ood9YduHiC5JbV1miO3VhmEpE4ZYO0xGw7qnOuQIy0NFl0qvDRQPhgoLx3wOuBgZA4XSYDJFZqDmdVccDOysbZtnGLfPHT6i3bpAVw82Lb0j/EDa27QxQ+aPXZ/i0Dz7eEAYCzl1Q3VWZ0k61k9fHgSeurVL/pN01wrfJntXEdXA0NXEFvD98cYUEbmH9cqyGeq6D7fxJyyTsyWT4nfxv7X/3Qfhtvrdcb/1P/9JCPeu0XAFMX/v3ddut9GYD3EZi9blwi873XaD/ySNiRtb8E55oF7pq4waOhLrXY1p7oWpL71f1i5UzDuFxbdgzqaGNiT7dVoTJQXCxjWC8cHUaBljIQfURi4Leol87UAxm+CX7AdaO/TwOZwaBF71gG34g8CviBxRH7f080rbtoOBGmCfkhxmSPy49LSVj2WiN1jSqy3XKsd2Kq5soyq+p8/TqlWlWNqrntWbrVclnme64pkD+eE8/rH+P8F9FZ6QvdpxmKYtShocD9Ipdpc0BCqP7ZLcsqUU82oR3jRae5op8oJPtQWRZWOi5HYhQytOZUrfifpdWN/KyOj2Nco2D+KbwEDlLOzvc8QnwvqSqUI2n5+b2pij+hQdsO0vxqB7q2q3ZB5n+KoWKj/uHBwLiEOEdT/l0es2vlgYIDIfhUOMNCZbgxTb0FD9a2pDgpciPglr2zR8aizxc4ixJqpfyGNuaoT/U1NUOWx/oKjms8E8t7CfQHamU9ZgMEUzTb2mMlOXreMhl0WbjCdNZI8c5nmy1HHtLoKVCws7DHJ+dwzjCxwe45+TbXUKe1TRp5jnzuZOl5ukyR7re7d+ABrls2pLEvcoU7Tlnshyc5yztHvA/l0dN/fzbL6kOpTwWWGnzS4dfBY/vA+DcWZwbibdYX9fLmz6JxdU6YbKxnfxbnlYaiOWftIm+6XtOX+YFMHb+LglCp1D6QNQvMBQxWOtKA2J30jthV/yxPF5WuZ3G5nLYTjz+TUc4HtRXPhtXmv7kcpelDkB4nNyj3UYqMyTuZV6Gcsj9auOB1Qqh1o728EgKblTmh+3dflz/No4TC842ho4UP2Z1tqepbZNwMlcirYz7ANj3jF1L4YD5UB9Eh77MG09c/YQPMoH57ZnoH5ErfRrY+oyhnNT4sOsxW54ruVZMF/X82yuvZRuRX5pUh3BdSxFe7neKWMaHFMft/MSWc8W5ehYDt38RrEV0JAc3NzOO5U7NE/mJ7S86GBAiodzS8fHHFDHQYcNO6DRGK2S6+D0+crPpHyc1eTi/DYlxziL5AUKXdTM7RFa7et29iI+owcXqoV/drJlSFMfvAgvGJaaT0V0xG3aplMz3lTcwofu3Oblue04sH/ctRUMGEjYr2ht3XAbnclhxBiGrnEehLNLSL+wNYXyEsxHNd+PLejJHg9nJf54y66cOSy1sC0s5NxaDrQ4T4jSyFVPtnc4cxDtZ2YkCXYLt7DNd418GPUJurdFqHTkLyybrNNAL/z1Bn7ZYwxk6yQXl1a+eCOXqpTtQKd+dmhSXq4thRuimtkBOZDVA2HsXTLXs0u7A8WDXIrjjozaT5bs1j/TWj4HVx4u/UR51LJZttXhz1VLHJSG9aXXepwgu9KEHIT90SjtnfDHW51LGtXPKirstkzj/pe3oxNbSu3nUac1zpErW885YlvqJahuC48HWxxnTnBXyd5bLY3W8tjl1hRl/J7WYThZtj6ZT7Plu/BJyI7bZHn3eKh7I8+Mx1hwjywcKAhu0ILyzJuVMZQwHWZaq32vgufTGk5UH4gI2yCsN+vtxtXfMx3PhZwd1po3bXs+4fVa67kxqZ4p6ZgV8S+Sxmm5XpCRNYIxE838UfOasHlL2fjJdurSfpzslP983mDJL+mzzWXr2jIplj0Ghirln7sxWRuAOb8GIqLlUiTfK5N4C/PVPdbTiOpp6uOP15jeMD/nD57IYXmaQO0kD+Xo4xQtuMWaI/ExVrQV+5jnKH1MgWdgMA7MAK2PvhbhWbfU42dwR09oKNuwxSd9zU4cqneP9zNNzYzBdAgnXbj8hbYyUBlaYoHlZ0TnpM93ZVudKbDltZYP0o2HZovmsMvfI3Sg5QMkB0vaSVutnTQm1mVE6UBOG6E6fMCX56xW12cC0bkIBHLu6DxjCHsstx4cyIw1me5szY6MZ+B+WcskOue/Hk9GWdb5qNHynptVjKluyLzGU2SKKK/yAXejfdE+8FESpjwQH3d2sS2Zpct03cLfuxI6vif5z8yLU4eFP5iDgUlLK8QUOcuOcsbSXynyLRZdxyvXpzLpDsvocGcL/NoS1bVVtbSXsSXKMN1MDjGzZ+A6OUGExmiLssyo5BNyeofAe7auRGAwje3Jxm2f6GHUWqhKG/ucvDbHKHKaZV4Vz/sNvvHcqS0ndDnG+LW2Lv7zYsm32+sCslVujefLNnpySLjAdPstJ61VaocP2/VM6euRUzPuTW3laoPqtA6yrqcv3WzWuvblF/v5v7E9qsu3bJ2D2YHLjrM3f70f8do4tIkbmJ/FRWEH5HP3U0MjvPhwrsXplC9p3NT2o01tyL21//wLW4dOeKRqx9G651be+zXxYqPlZg+4RXNLujP3q7abXK6vgaeG9pjMNGw32GKNgt0bGsphzsbAZz2CCzwex3pQYL3mVm3SE1vldfjlLzw/8qyv2pF7rpXxN8suqxzRY+yQvCrKJPHOXRS4uNfG+yfybi88S7XW/kH9wzngqvUJHFOlpJxf6Mw3d7chYJiEUWP7uGPsDHevgrOxlYlcyxW29GZzy5OOLQaesEHts1dP4+eiTH9VnhN7eBO5eL5E/gRX1b2zMKp/DDZLG8b2XT2ul0/sD3lGaJd2vzW8JC5PADfewHy0B5xCc1X64tn8TNeZtKd5p9b5t2+cQ3lheXkJe1keTutqViPO1Kc59lcL/736Dc8cxXL0sevbF/IepgWt1Vm4hmYypi6V22cVW5gX95fXzbX3Yk20a028tb7ZOdbDbfT3/o9//rjq0/OuGkeTtc9Q+nOnj/04e7LYf95BYv+rTwC7/MCv1x/UdeHpXPWztU7O0wqs0H/AP2767969+x7+T29JjjK76VHe+NB9FOP30Ip91dkmfhTe9Hbs9xDz701vSZ/5HgZOatlWat18D3s9TKFyQPwbWcBBCbnV63kOCt4n3gfoWdu0/lSvZ8Xx+00GnG3opE7y3o8+tEfqesYPk9QK4bPPBVZouY79DuQ3vYmDguNzlfjix7ZZmPr1ryaxAwnsaR47N70K0fhS4iAHptH25q0nQNi9GPVdgVDMT/QKvX/TwzxdXSmY/6Z3L3wrLx75sXz4OaofJbqD8FYcJx+O1P9cPfsfzQC9nhWGUUoEsJwEkbDme+nWd11nm9z0/u+7Ev//KP/o9f59/LPX+95/2EbB9/5N4yq+jjH7vX/zvWUZvvev2k9i1JAnC7NEpfx7v/7cj6vWV30H2Vh5kxcxZ78vSf+e/ILVQY3/YP75nsyyPuKP8s9/1uSi1Iw3PbZDJgIrhd6szgAvo/NLKV3CX36uzof4P9T89JP893LYXs6Hl/DiS/mx16uQj/87Eq02zVJr7B1Q5wFC0nKskmdq9+uKpz1UXQGV/1Xf/naikaohuzQTIU3dA2h9s3IGbk6GIx9qw9I0662XCgt/OpSVeje9Dy/7Qv3My2qk4tDLm+cVK/E2qY/UoVnJ3Sbdj3qWxzcF8up/h6WlbnIxjTSqsE3R0dKMxb36BGp8TU9qLYciFlz0C8hd/8gS2danJL/Yjy5H2DoP5fdLv50AkG6t1HHzBgiUSwpRJn8vm4/1ethA1Bj2qam3Jl98+HiHBCE3vQr55V0n3HUojO/9z1/v5bv7fy3vb5X71bd/fVO+Tu+E+6ZlISc844etOKZ3KvtXei2Fv0T4VvCs0HWelxKinhIywQ4p6bC6Rymp4ea/Q0pQFG2ymEYMxWRQBC1008MhxvO4JkFMB5bp9TNYVvDN/xJ/v1Q8rVinLXClv15KeM3nLm1IWqGjFszd9HAsV/iTT4WDN70yGvwe9q/6O0ooEojWsxDQ2/pJGsVe/8f/CwAA//+hqYUMpacAAA== type: helm.sh/release.v1
Decoded json:
{ "name": "dotnet", "info": { "first_deployed": "2023-02-14T23:49:12.655951052+01:00", "last_deployed": "2023-02-14T23:49:12.655951052+01:00", "deleted": "", "description": "Install complete", "status": "deployed", "notes": "\nYour .NET app is building! To view the build logs, run:\n\noc logs bc/dotnet --follow\n\nNote that your Deployment will report \"ErrImagePull\" and \"ImagePullBackOff\" until the build is complete. Once the build is complete, your image will be automatically rolled out." }, "chart": { "metadata": { "name": "dotnet", "version": "0.0.1", "description": "A Helm chart to build and deploy .NET applications", "keywords": [ "runtimes", "dotnet" ], "apiVersion": "v2", "annotations": { "chart_url": "https://github.com/openshift-helm-charts/charts/releases/download/redhat-dotnet-0.0.1/redhat-dotnet-0.0.1.tgz" } }, "lock": null, "templates": [ /* removed */ ], "values": { "build": { "contextDir": null, "enabled": true, "env": null, "imageStreamTag": { "name": "dotnet:3.1", "namespace": "openshift", "useReleaseNamespace": false }, "output": { "kind": "ImageStreamTag", "pushSecret": null }, "pullSecret": null, "ref": "dotnetcore-3.1", "resources": null, "startupProject": "app", "uri": "https://github.com/redhat-developer/s2i-dotnetcore-ex" }, "deploy": { "applicationProperties": { "enabled": false, "mountPath": "/deployments/config/", "properties": "## Properties go here" }, "env": null, "envFrom": null, "extraContainers": null, "initContainers": null, "livenessProbe": { "tcpSocket": { "port": "http" } }, "ports": [ { "name": "http", "port": 8080, "protocol": "TCP", "targetPort": 8080 } ], "readinessProbe": { "httpGet": { "path": "/", "port": "http" } }, "replicas": 1, "resources": null, "route": { "enabled": true, "targetPort": "http", "tls": { "caCertificate": null, "certificate": null, "destinationCACertificate": null, "enabled": true, "insecureEdgeTerminationPolicy": "Redirect", "key": null, "termination": "edge" } }, "serviceType": "ClusterIP", "volumeMounts": null, "volumes": null }, "global": { "nameOverride": null }, "image": { "name": null, "tag": "latest" } }, "schema": "removed", "files": [ { "name": "README.md", "data": "removed" } ] }, "config": { "build": { "enabled": true, "imageStreamTag": { "name": "dotnet:3.1", "namespace": "openshift", "useReleaseNamespace": false }, "output": { "kind": "ImageStreamTag" }, "ref": "dotnetcore-3.1", "startupProject": "app", "uri": "https://github.com/redhat-developer/s2i-dotnetcore-ex" }, "deploy": { "applicationProperties": { "enabled": false, "mountPath": "/deployments/config/", "properties": "## Properties go here" }, "livenessProbe": { "tcpSocket": { "port": "http" } }, "ports": [ { "name": "http", "port": 8080, "protocol": "TCP", "targetPort": 8080 } ], "readinessProbe": { "httpGet": { "path": "/", "port": "http" } }, "replicas": 1, "route": { "enabled": true, "targetPort": "http", "tls": { "enabled": true, "insecureEdgeTerminationPolicy": "Redirect", "termination": "edge" } }, "serviceType": "ClusterIP" }, "image": { "tag": "latest" } }, "manifest": "---\n# Source: dotnet/templates/service.yaml\napiVersion: v1\nkind: Service\nmetadata:\n name: dotnet\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\nspec:\n type: ClusterIP\n selector:\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n ports:\n - name: http\n port: 8080\n protocol: TCP\n targetPort: 8080\n---\n# Source: dotnet/templates/deployment.yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: dotnet\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\n annotations:\n image.openshift.io/triggers: |-\n [\n {\n \"from\":{\n \"kind\":\"ImageStreamTag\",\n \"name\":\"dotnet:latest\"\n },\n \"fieldPath\":\"spec.template.spec.containers[0].image\"\n }\n ]\nspec:\n replicas: 1\n selector:\n matchLabels:\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n template:\n metadata:\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\n spec:\n containers:\n - name: web\n image: dotnet:latest\n ports:\n - name: http\n containerPort: 8080\n protocol: TCP\n livenessProbe:\n tcpSocket:\n port: http\n readinessProbe:\n httpGet:\n path: /\n port: http\n volumeMounts:\n volumes:\n---\n# Source: dotnet/templates/buildconfig.yaml\napiVersion: build.openshift.io/v1\nkind: BuildConfig\nmetadata:\n name: dotnet\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\nspec:\n output:\n to:\n kind: ImageStreamTag\n name: dotnet:latest\n source:\n type: Git\n git:\n uri: https://github.com/redhat-developer/s2i-dotnetcore-ex\n ref: dotnetcore-3.1\n strategy:\n type: Source\n sourceStrategy:\n from:\n kind: ImageStreamTag\n name: dotnet:3.1\n namespace: openshift\n env:\n - name: \"DOTNET_STARTUP_PROJECT\"\n value: \"app\"\n triggers:\n - type: ConfigChange\n---\n# Source: dotnet/templates/imagestream.yaml\napiVersion: image.openshift.io/v1\nkind: ImageStream\nmetadata:\n name: dotnet\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\nspec:\n lookupPolicy:\n local: true\n---\n# Source: dotnet/templates/route.yaml\napiVersion: route.openshift.io/v1\nkind: Route\nmetadata:\n name: dotnet\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\nspec:\n to:\n kind: Service\n name: dotnet\n port:\n targetPort: http\n tls:\n termination: edge\n insecureEdgeTerminationPolicy: Redirect\n", "version": 1 }
Clone of OCPBUGS-7906, but for all the other CSI drivers and operators than shared resource. All Pods / containers that are part of the OCP platform should run on dedicated "management" CPUs (if configured). I.e. they should have annotation 'target.workload.openshift.io/management:{"effect": "PreferredDuringScheduling"}' .
So far nobody ran our cloud CSI drivers with CPU pinning enabled, so this bug is a low prio. I checked LSO, it already has correct CPU pinning in all Pods, e.g. here.
Description of problem:
The UI should add an alert for deprecating DeploymentConfig in 4.14
Version-Release number of selected component (if applicable):
pre-merge
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
The alert is missing
Expected results:
The alert should exist
Additional info:
Description of problem:
This is to track the SDN specific issue in https://issues.redhat.com/browse/OCPBUGS-18389 4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.z in node-density (lite) test
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-11-201102
How reproducible:
Everytime
Steps to Reproduce:
1. Install a SDN cluster and scale up to 24 worker nodes, install 3 infra nodes and move monitoring, ingress, registry components to infra nodes. 2. Run node-density (lite) test with 245 pod per node 3. Compare the pod ready latency to 4.13.z, and 4.14 ec4
Actual results:
4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.10
Expected results:
4.14 should have similar pod ready latency compared to previous release
Additional info:
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-ec.4 | 231559 | 292 | 087eb40c-6600-4db3-a9fd-3b959f4a434a | aws | amd64 | SDN | 24 | 245 | 2186 | 3256 | https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link |
4.14.0-0.nightly-2023-09-02-132842 | 231558 | 291 | 62404e34-672e-4168-b4cc-0bd575768aad | aws | amd64 | SDN | 24 | 245 | 58725 | 294279 | https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link |
With the new multus image provided by Dan Williams in https://issues.redhat.com/browse/OCPBUGS-18389, SDN 24 nodes's latency is similar to without the fix.
% oc -n openshift-network-operator get deployment.apps/network-operator -o yaml | grep MULTUS_IMAGE -A 1 - name: MULTUS_IMAGE value: quay.io/dcbw/multus-cni:informer % oc get pod -n openshift-multus -o yaml | grep image: | grep multus image: quay.io/dcbw/multus-cni:informer ....
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer | 232389 | 314 | f2c290c1-73ea-4f10-a797-3ab9d45e94b3 | aws | amd64 | SDN | 24 | 245 | 61234 | 311776 | https://drive.google.com/file/d/1o7JXJAd_V3Fzw81pTaLXQn1ms44lX6v5/view?usp=drive_link |
4.14.0-ec.4 | 231559 | 292 | 087eb40c-6600-4db3-a9fd-3b959f4a434a | aws | amd64 | SDN | 24 | 245 | 2186 | 3256 | https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link |
4.14.0-0.nightly-2023-09-02-132842 | 231558 | 291 | 62404e34-672e-4168-b4cc-0bd575768aad | aws | amd64 | SDN | 24 | 245 | 58725 | 294279 | https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link |
Zenghui Shi Peng Liu request to modify the multus-daemon-config ConfigMap by removing readinessindicatorfile flag
Steps:
Now the readinessindicatorfile flag is removed and And all multus pods are restarted
% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c 0
Test Result: p99 is better compared to without the fix(remove readinessindicatorfile) but is stall worse than ec4, avg is still bad.
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove readinessindicatorfile flag | 232389 | 316 | d7a754aa-4f52-49eb-80cf-907bee38a81b | aws | amd64 | SDN | 24 | 245 | 51775 | 105296 | https://drive.google.com/file/d/1h-3JeZXQRO-zsgWzen6aNDQfSDqoKAs2/view?usp=drive_link |
Zenghui Shi Peng Liu request to set logLever to debug in additional to removing readinessindicatorfile flag
edit the cm to set "logLevel": "verbose" -> "debug" and restart all multus pods
Now the logLever is debug and And all multus pods are restarted
% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep logLevel "logLevel": "debug", % oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c 0
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove readinessindicatorfile flag and logLevel=debug | 232389 | 320 | 5d1d3e6a-bfa1-4a4b-bbfc-daedc5605f7d | aws | amd64 | SDN | 24 | 245 | 49586 | 105314 | https://drive.google.com/file/d/1p1PDbnqm0NlWND-komc9jbQ1PyQMeWcV/view?usp=drive_link |
Description of problem:
The bootstrapExternalStaticGateway IP uses as DNS for bootstrap node
Version-Release number of selected component (if applicable):
4.11
How reproducible:
100%
Steps to Reproduce:
1. Deploy baremetal IPI using static boostrap IP. 2. It consumes bootstrapExternalStaticGateway as DNS for the bootstrap node. 3.
Actual results:
Sometimes bootstrapExternalStaticGateway cannot act as DNS
Expected results:
DNS resolution should work on bootstrap if it uses static IP
Additional info:
Description of problem: While running scale tests of OpenShift on OpenStack at scale, we're seeing it performing significantly worse than on AWS platform for the same number of nodes. More specifically, we're seeing high traffic to API server, and high load for the haproxy pod.
Version-Release number of selected component (if applicable):
All supported versions
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Slack thread at https://coreos.slack.com/archives/CBZHF4DHC/p1669910986729359 provides more info.
Description of the problem:
When starting installation where the nodes has multiple disks on 4.13, after reboot the installation might stuck on "pending user action" with the following error:
Expected the host to boot from disk, but it booted the installation image - please reboot and fix the boot order to boot from disk QEMU_HARDDISK 05abcd32e95a61a3 (sda, /dev/disk/by-id/wwn-0x05abcd32e95a61a3).
When running the live-iso with RHEL /dev/sda might actually be vdb.
Since the boot order configuration is usally HD first, machine usually try vda before it moves on to try other boot options (that are not HD).
When installing on /dev/sda (vdb) the machine might not try to boot from the installation disk.
Solution suggestion:
A better way to find vda is by the hctl ( 0:0:0:0 should be /vda)
Action item: in case of libvirt (why not all platforms?) we should update the way we choose the default installation disk and choose the disk with hctl 0:0:0:0 (when it's available...)
How reproducible:
Create nodes with 2 disks and start installation.
Steps to reproduce:
1. Register new cluster
2. Add 6 nodes (3 master + 3 workers) with multiple disks each - might be even reproducible with only 3 masters
3. Start the installation
Note that it might take a few attempts to reproduce this issue
Actual results:
Pending for input
Expected results:
Installation success
Slack thread https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1684317064257809
Description of problem:
If secure boot is currently disabled, and user attempts to enable it via ZTP, install will not begin the first time ZTP was triggered.
When secure boot is enabled viz ZTP, then boot options will be configured before virtual CD was attached, thus first boot will be booting into existing HD with secure boot on. Install will then get stuck because boot from CD was never triggered.
Version-Release number of selected component (if applicable):
4.10
How reproducible:
Always
Steps to Reproduce:
1. Secure boot is currently disabled in bios
2. Attempt to deploy a cluster with secure boot enabled via ZTP
3.
Actual results:
Expected results:
Additional info:
Secure boot config used in ZTP siteconfig:
http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/ff814164cdcd355ed980f1edf269dbc2afbe09aa/siteconfig/master-2.yaml#L40
Description of problem:
The option to Enable/Disable a console plugin on Operator details page is not shown any more, it looks like a regression(the option is shown in 4.13)
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-04-19-125337
How reproducible:
Always
Steps to Reproduce:
1. Subscribe 'OpenShift Data Foundation' Operator from OperatorHub 2. on Operator installation page, we choose 'Disable' plugin 3. once operator is successfully installed, go to Installed Operators list page /k8s/all-namespaces/operators.coreos.com~v1alpha1~ClusterServiceVersion 4. console will show 'Plugin available' button for 'OpenShift Data Foundation' Operator, click on the button and hit 'View operator details', user will be taken to Operator details page
Actual results:
4. in OCP <= 4.13, we will show a 'Console plugin' item where user can Enable/Disable the console plugin operator has bring in however this option is not shown in 4.14
Expected results:
4. Enable/Disable console plugin should be shown on Operator details page
Additional info:
screen recording https://drive.google.com/drive/folders/1fNlodAg6yUeUqf07BG9scvwHlzAwS-Ao?usp=share_link
Description of the problem:
https://redhat-internal.slack.com/archives/C01QX5JEDP0/p1682946068422739?thread_ts=1682945335.566899&cid=C01QX5JEDP0
post installation . downloaded collected logs and agent logs are emtpy.
attaching logs.
Description of problem:
Due to a CI configuration issue (lack of nmstatectl in the image), the current CI unit-test job skips silently those unit tests requiring nmstatectl.
Version-Release number of selected component (if applicable):
How reproducible:
hack/go-test.sh
Steps to Reproduce:
1. 2. 3.
Actual results:
Unit tests are failing
Expected results:
No failure
Additional info:
The following install-config fields are new in 4.13:
These fields are ignored by the agent-based installation method. Until such time as they are implemented, we should print a warning if they are set to non-default values, as we do for other fields that are ignored.
Description of problem:
After a component is ready, if we edit the component YAML from the console, it shows a stream of error. The YAML does get updated but the error goes away only after multiple reload.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Deploy a pod/deployment 2. After they are seen ready, update the YAML from console 3. Error is seen
Actual results:
Expected results:
No error
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1127
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/180
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Extracting the cli in darwin from a multi payload leads to "filtered all images from manifest list"
Version-Release number of selected component (if applicable):
Tested with oc4.11
How reproducible:
Always on Darwin machines
Steps to Reproduce:
1.oc adm release extract --command=oc quay.io/openshift-release-dev/ocp-release:4.11.4-multi -v5
Actual results:
I0909 18:37:28.591323 37669 config.go:127] looking for config.json at /Users/lwan/.docker/config.jsonI0909 18:37:28.591601 37669 config.go:135] found valid config.json at /Users/lwan/.docker/config.jsonWarning: the default reading order of registry auth file will be changed from "${HOME}/.docker/config.json" to podman registry config locations in the future version of oc. "${HOME}/.docker/config.json" is deprecated, but can still be used for storing credentials as a fallback. See https://github.com/containers/image/blob/main/docs/containers-auth.json.5.md for the order of podman registry config locations.I0909 18:37:30.391895 37669 client_mirrored.go:174] Attempting to connect to quay.io/openshift-release-dev/ocp-releaseI0909 18:37:30.696483 37669 client_mirrored.go:412] get manifest for sha256:53679d92dc0aea8ff6ea4b6f0351fa09ecc14ee9eda1b560deeb0923ca2290a1 served from registryclient.retryManifest{ManifestService:registryclient.manifestServiceVerifier{ManifestService:(*client.manifests)(0x14000a36330)}, repo:(*registryclient.retryRepository)(0x14000f46e80)}: <nil>I0909 18:37:30.696738 37669 manifest.go:405] Skipping image sha256:fcf4d95df9a189527453d8961a22a3906514f5ecbb05afbcd0b2cdd212aab1a2 for manifestlist.PlatformSpec{Architecture:"amd64", OS:"linux", OSVersion:"", OSFeatures:[]string(nil), Variant:"", Features:[]string(nil)} from quay.io/openshift-release-dev/ocp-release:4.11.4-multiI0909 18:37:30.696843 37669 manifest.go:405] Skipping image sha256:1992a4713410b7363ae18b0557a7587eb9e0d734c5f0f21fb1879196f40233a3 for manifestlist.PlatformSpec{Architecture:"ppc64le", OS:"linux", OSVersion:"", OSFeatures:[]string(nil), Variant:"", Features:[]string(nil)} from quay.io/openshift-release-dev/ocp-release:4.11.4-multiI0909 18:37:30.696869 37669 manifest.go:405] Skipping image sha256:3698082cd66e90d2b79b62d659b4e7399bfe0b86c05840a4c31d3197cdac4bfa for manifestlist.PlatformSpec{Architecture:"s390x", OS:"linux", OSVersion:"", OSFeatures:[]string(nil), Variant:"", Features:[]string(nil)} from quay.io/openshift-release-dev/ocp-release:4.11.4-multiI0909 18:37:30.697106 37669 manifest.go:405] Skipping image sha256:15fc18c81f053cad15786e7a52dc8bff29e647ea642b3e1fabf2621953f727eb for manifestlist.PlatformSpec{Architecture:"arm64", OS:"linux", OSVersion:"", OSFeatures:[]string(nil), Variant:"", Features:[]string(nil)} from quay.io/openshift-release-dev/ocp-release:4.11.4-multiI0909 18:37:30.697570 37669 workqueue.go:143] about to send work queue error: unable to read image quay.io/openshift-release-dev/ocp-release:4.11.4-multi: filtered all images from manifest listerror: unable to read image quay.io/openshift-release-dev/ocp-release:4.11.4-multi: filtered all images from manifest list
Expected results:
The darwin/$(uname -m) cli is extracted
Additional info:
Are we re-using some function from the `oc mirror` feature to select the manifest to use? It's like it is looking for a "darwin/$(uname -m)" and filter-out all the available linux manifests.
This is a clone of issue OCPBUGS-19037. The following is the description of the original issue:
—
The agent-interactive-console service is required by both sshd and systemd-logind, so if it exits with an error code there is no way to connect or log in to the box to debug.
Platform:
IPI on Baremetal
What happened?
In cases where no hostname is provided, host are automatically assigned the name "localhost" or "localhost.localdomain".
[kni@provisionhost-0-0 ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
localhost.localdomain Ready master 31m v1.22.1+6859754
master-0-1 Ready master 39m v1.22.1+6859754
master-0-2 Ready master 39m v1.22.1+6859754
worker-0-0 Ready worker 12m v1.22.1+6859754
worker-0-1 Ready worker 12m v1.22.1+6859754
What did you expect to happen?
Having all hosts come up as localhost is the worst possible user experience, because they'll fail to form a cluster but you won't know why.
However, we know the BMH name in the image-customization-controller, it would be possible to configure the ignition to set a default hostname if we don't have one from DHCP/DNS.
If not, we should at least fail the installation with a specific error message to this situation.
----------
30/01/22 - adding how to reproduce
----------
How to Reproduce:
1)prepare and installation with day-1 static ip.
add to install-config uner one of the nodes:
networkConfig:
routes:
config:
2)Ensure a DNS PTR for the address IS NOT configured.
3)create manifests and cluster from install-config.yaml
installation should either:
1)fail as early as possible, and provide some sort of feed back as to the fact that no hostname was provided.
2)derive the Hostname from the bmh or the ignition files
Please review the following PR: https://github.com/openshift/images/pull/131
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/images/pull/132
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Nodes are taking more than 5m0s to stage OSUpdate https://sippy.dptools.openshift.org/sippy-ng/tests/4.13/analysis?test=%5Bbz-Machine%20Config%20Operator%5D%20Nodes%20should%20reach%20OSUpdateStaged%20in%20a%20timely%20fashion Test started failing back on 2/16/2023. First occurrence of the failure https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-upgrade/1626326464246845440 Most recent occurrences across multiple platforms https://search.ci.openshift.org/?search=Nodes+should+reach+OSUpdateStaged+in+a+timely+fashion&maxAge=48h&context=1&type=junit&name=4.13&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
6 nodes took over 5m0s to stage OSUpdate:node/ip-10-0-216-81.ec2.internal OSUpdateStarted at 2023-02-16T22:24:56Z, did not make it to OSUpdateStaged node/ip-10-0-174-123.ec2.internal OSUpdateStarted at 2023-02-16T22:13:07Z, did not make it to OSUpdateStaged node/ip-10-0-144-29.ec2.internal OSUpdateStarted at 2023-02-16T22:12:50Z, did not make it to OSUpdateStaged node/ip-10-0-179-251.ec2.internal OSUpdateStarted at 2023-02-16T22:15:48Z, did not make it to OSUpdateStaged node/ip-10-0-180-197.ec2.internal OSUpdateStarted at 2023-02-16T22:19:07Z, did not make it to OSUpdateStaged node/ip-10-0-213-155.ec2.internal OSUpdateStarted at 2023-02-16T22:19:21Z, did not make it to OSUpdateStaged}
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/112
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Tracker issue for bootimage bump in 4.14. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-13253.
follow-up fixes after the bump of k8 to 1.27 openshift/api#1424
This is a clone of issue OCPBUGS-18103. The following is the description of the original issue:
—
Description:
Now that the huge e2e test case failures in CI jobs is resolved in the recent jobs observed a Undiagnosed panic detected in pod issue.
Error:
{ pods/openshift-image-registry_cluster-image-registry-operator-7f7bd7c9b4-k8fmh_cluster-image-registry-operator_previous.log.gz:E0825 02:44:06.686400 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) pods/openshift-image-registry_cluster-image-registry-operator-7f7bd7c9b4-k8fmh_cluster-image-registry-operator_previous.log.gz:E0825 02:44:06.686630 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}Some Observations:
1)While starting ImageConfigController it Failed to watch *v1.Route: as the server could not find the requested resource",
2)which eventually lead sync problem "E0825 01:26:52.428694 1 clusteroperator.go:104] unable to sync ClusterOperatorStatusController: config.imageregistry.operator.openshift.io "cluster" not found, requeuing"
3)and then while creating deployment resource for "cluster-image-registry-operator" it caused a panic error: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference):"
Description of the problem:
When installing a cluster and we have multiple networks, we can not change the machine network from UI ( its not changed to the new machine network) but when installing it shows the chosen network.
from customer view :
he choose machine network , its in the list but never shown as chosen but actually it appears when installing.
How reproducible:
Always
Steps to reproduce:
Install cluster , mutiple networks
Try to change machine network -> does not work
Actual results:
Expected results:
Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/187
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
On add storage page, if user choose use existing pvc, but leave the pvc name empty, after other fields are filled, click "Save", there is not warning info about the pvc name field. The loading dot icons are shown under "Save" button.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-12-124310
How reproducible:
Always
Steps to Reproduce:
1.Create a deployment. 2.Click "Add Storage" item in action list of the deployment 3.Choose "Use existing claim", but leave it empty. 4.Set mount dir and click "Save".
Actual results:
4. There is not warning info about the empty pvc name.
Expected results:
4. Should show info for the field:"Please fill out this field"
Additional info:
Description of the problem:
When creating/updating an InfraEnv, the size of compressed ignition should be validated.
I.e. the service should generate the entire ignition for each request, compress it (as done in ignition Archive), and ensure its size is up to 256KiB.
Notes:
How reproducible:
100%
Steps to reproduce:
1. Register an InfraEnv that would result with an ignition archive larger than 256KIB.
E.g. Invoke 'POST /v2/infra-envs' with large values in body (infra-env-create-params)
Actual results:
Register request succeed, but downloading the ISO fails.
Expected results:
The request should fail with an error message explaining the generated ignition archive is too large.
Description of problem:
oc-mirror fails to complete with heads only complaining about devworkspace-operator
Version-Release number of selected component (if applicable):
# oc-mirror version Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.12.0-202302280915.p0.g3d51740.assembly.stream-3d51740", GitCommit:"3d517407dcbc46ededd7323c7e8f6d6a45efc649", GitTreeState:"clean", BuildDate:"2023-03-01T00:20:53Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Attempt a headsonly mirroring for registry.redhat.io/redhat/redhat-operator-index:v4.10
Steps to Reproduce:
1. Imageset currently:
kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: registry: imageURL: myregistry.mydomain:5000/redhat-operators skipTLS: false mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10
2.$ oc mirror --config=./imageset-config.yml docker://otherregistry.mydomain:5000/redhat-operators Checking push permissions for otherregistry.mydomain:5000 Found: oc-mirror-workspace/src/publish Found: oc-mirror-workspace/src/v2 Found: oc-mirror-workspace/src/charts Found: oc-mirror-workspace/src/release-signatures WARN[0026] DEPRECATION NOTICE: Sqlite-based catalogs and their related subcommands are deprecated. Support for them will be removed in a future release. Please migrate your catalog workflows to the new file-based catalog format. The rendered catalog is invalid. Run "oc-mirror list operators --catalog CATALOG-NAME --package PACKAGE-NAME" for more information. error: error generating diff: channel fast: head "devworkspace-operator.v0.19.1-0.1679521112.p" not reachable from bundle "devworkspace-operator.v0.19.1"
Actual results:
error: error generating diff: channel fast: head "devworkspace-operator.v0.19.1-0.1679521112.p" not reachable from bundle "devworkspace-operator.v0.19.1"
Expected results:
For the catalog to be mirrored.
Description of problem:
When deploying with external platform, the reported state of the machine config pool is degraded, and we can observe a drift in the configuration: $ diff /etc/mcs-machine-config-content.json ~/rendered-master-1b6aab788192600896f36c5388d48374 < "contents": "[Unit]\nDescription=Kubernetes Kubelet\nWants=rpc-statd.service network-online.target\nRequires=crio.service kubelet-auto-node-size.service\nAfter=network-online.target crio.service kubelet-auto-node-size.service\nAfter=ostree-finalize-staged.service\n\n[Service]\nType=notify\nExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests\nExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state\nExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state\nEnvironmentFile=/etc/os-release\nEnvironmentFile=-/etc/kubernetes/kubelet-workaround\nEnvironmentFile=-/etc/kubernetes/kubelet-env\nEnvironmentFile=/etc/node-sizing.env\n\nExecStart=/usr/local/bin/kubenswrapper \\\n /usr/bin/kubelet \\\n --config=/etc/kubernetes/kubelet.conf \\\n --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \\\n --kubeconfig=/var/lib/kubelet/kubeconfig \\\n --container-runtime-endpoint=/var/run/crio/crio.sock \\\n --runtime-cgroups=/system.slice/crio.service \\\n --node-labels=node-role.kubernetes.io/control-plane,node-role.kubernetes.io/master,node.openshift.io/os_id=${ID} \\\n --node-ip=${KUBELET_NODE_IP} \\\n --minimum-container-ttl-duration=6m0s \\\n --cloud-provider=external \\\n --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \\\n \\\n --hostname-override=${KUBELET_NODE_NAME} \\\n --provider-id=${KUBELET_PROVIDERID} \\\n --register-with-taints=node-role.kubernetes.io/master=:NoSchedule \\\n --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bde9fb486f1e8369b465a8c0aff7152c2a1f5a326385ee492140592b506638d6 \\\n --system-reserved=cpu=${SYSTEM_RESERVED_CPU},memory=${SYSTEM_RESERVED_MEMORY},ephemeral-storage=${SYSTEM_RESERVED_ES} \\\n --v=${KUBELET_LOG_LEVEL}\n\nRestart=always\nRestartSec=10\n\n[Install]\nWantedBy=multi-user.target\n", --- > "contents": "[Unit]\nDescription=Kubernetes Kubelet\nWants=rpc-statd.service network-online.target\nRequires=crio.service kubelet-auto-node-size.service\nAfter=network-online.target crio.service kubelet-auto-node-size.service\nAfter=ostree-finalize-staged.service\n\n[Service]\nType=notify\nExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests\nExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state\nExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state\nEnvironmentFile=/etc/os-release\nEnvironmentFile=-/etc/kubernetes/kubelet-workaround\nEnvironmentFile=-/etc/kubernetes/kubelet-env\nEnvironmentFile=/etc/node-sizing.env\n\nExecStart=/usr/local/bin/kubenswrapper \\\n /usr/bin/kubelet \\\n --config=/etc/kubernetes/kubelet.conf \\\n --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \\\n --kubeconfig=/var/lib/kubelet/kubeconfig \\\n --container-runtime-endpoint=/var/run/crio/crio.sock \\\n --runtime-cgroups=/system.slice/crio.service \\\n --node-labels=node-role.kubernetes.io/control-plane,node-role.kubernetes.io/master,node.openshift.io/os_id=${ID} \\\n --node-ip=${KUBELET_NODE_IP} \\\n --minimum-container-ttl-duration=6m0s \\\n --cloud-provider= \\\n --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \\\n \\\n --hostname-override=${KUBELET_NODE_NAME} \\\n --provider-id=${KUBELET_PROVIDERID} \\\n --register-with-taints=node-role.kubernetes.io/master=:NoSchedule \\\n --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bde9fb486f1e8369b465a8c0aff7152c2a1f5a326385ee492140592b506638d6 \\\n --system-reserved=cpu=${SYSTEM_RESERVED_CPU},memory=${SYSTEM_RESERVED_MEMORY},ephemeral-storage=${SYSTEM_RESERVED_ES} \\\n --v=${KUBELET_LOG_LEVEL}\n\nRestart=always\nRestartSec=10\n\n[Install]\nWantedBy=multi-user.target\n", the difference is --cloud-provider=external /--cloud-provider= is the flags passed to the kubelet. We also observe the following log in the MCC: W0629 09:57:44.583046 1 warnings.go:70] unknown field "spec.infra.status.platformStatus.external.cloudControllerManager" "spec.infra.status.platformStatus.external.cloudControllerManager" is basically the flag in the Infrastructure object that enables the external platform.
Version-Release number of selected component (if applicable):
4.14 nightly
How reproducible:
Always when platform is external
Steps to Reproduce:
1. Deploy a cluster with the external platform enabled, the featureSet TechPreviewNoUpgrade should be set and the Infrastructure object should look like: apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-06-28T10:37:12Z" generation: 1 name: cluster resourceVersion: "538" uid: 57e09773-0eca-4767-95ce-8ec7d0f2cdae spec: cloudConfig: name: "" platformSpec: external: platformName: oci type: External status: apiServerInternalURI: https://api-int.test-infra-cluster-3cd17632.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443 apiServerURL: https://api.test-infra-cluster-3cd17632.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443 controlPlaneTopology: HighlyAvailable cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: test-infra-cluster-3c-pqqqm infrastructureTopology: HighlyAvailable platform: External platformStatus: external: cloudControllerManager: state: External type: External 2. Observe the drift with: oc get mcp
Actual results:
$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master False True True 3 0 0 3 138m worker rendered-worker-d48036fe2b657e6c71d5d1275675fefc True False False 3 3 3 0 138m
Expected results:
$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-2ff4e25f807ef3b20b7c6e0c6526f05d True False False 3 3 3 0 33m worker rendered-worker-48b7f39d78e3b1d94a0aba1ef4358d01 True False False 3 3 3 0 33m
Additional info:
https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1688035248716119
The TestMetrics e2e test is not correctly cleaning up the MachineConfigs and MachineConfigPools it creates. This means that other e2e tests which run after this e2e test can falsely fail or become flaky.
What's happening is this:
The cleanup flow should look like this:
Description of problem:
a cluster update request with empty strings for api_vip and ingress_vip will not remove the cluster vips.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. See the following test: https://gist.github.com/nmagnezi/4a3dad01ee197d3984fa7a0604b62cc0 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
https://issues.redhat.com//browse/OCPBUGS-5287 disabled the test due to https://issues.redhat.com/browse/THREESCALE-9015. Once https://issues.redhat.com/browse/THREESCALE-9015 is resolved, need to re-enable the test.
Description of problem:
After an upgrade from 4.9 to 4.10 collect+ process causing CPU bursts of 5-6 seconds every 15 minutes regularly. During each burst collect+ consume 100% CPU. Top Command Dump Sample: top - 07:00:04 up 10:10, 0 users, load average: 0.20, 0.24, 0.27 Tasks: 247 total, 1 running, 246 sleeping, 0 stopped, 0 zombie %Cpu(s): 6.3 us, 4.5 sy, 0.0 ni, 80.8 id, 7.4 wa, 0.8 hi, 0.3 si, 0.0 st MiB Mem : 32151.9 total, 22601.4 free, 2182.1 used, 7368.4 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 29420.7 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2009 root 20 0 3741252 172136 71396 S 12.9 0.5 36:42.79 kubelet 1954 root 20 0 2663680 130928 46156 S 7.9 0.4 6:50.44 crio 9440 root 20 0 1633728 546036 60836 S 7.9 1.7 21:06.80 fluentd 1 root 20 0 238416 15412 8968 S 5.9 0.0 1:56.73 systemd 1353 800 10 -10 796808 165380 40916 S 5.0 0.5 2:32.11 ovs-vsw+ 5454 root 20 0 1729112 73680 37404 S 2.0 0.2 3:52.21 coredns 1061248 1000360+ 20 0 1113524 24304 17776 S 2.0 0.1 0:00.03 collect+ 306 root 0 -20 0 0 0 I 1.0 0.0 0:00.37 kworker+ 957 root 20 0 264076 126280 119596 S 1.0 0.4 0:06.80 systemd+ 1114 dbus 20 0 83188 6224 5140 S 1.0 0.0 0:04.30 dbus-da+ 5710 root 20 0 406004 31384 15068 S 1.0 0.1 0:04.11 tuned 6198 nobody 20 0 1632272 46588 20516 S 1.0 0.1 0:17.60 network+ 1061291 1000650+ 20 0 11896 2748 2496 S 1.0 0.0 0:00.01 bash 1061355 1000650+ 20 0 11896 2868 2616 S 1.0 0.0 0:00.01 bashtop - 07:00:05 up 10:10, 0 users, load average: 0.20, 0.24, 0.27 Tasks: 248 total, 2 running, 245 sleeping, 0 stopped, 1 zombie %Cpu(s): 11.4 us, 2.0 sy, 0.0 ni, 81.5 id, 4.2 wa, 0.6 hi, 0.2 si, 0.0 st MiB Mem : 32151.9 total, 22601.4 free, 2182.1 used, 7368.4 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 29420.7 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1061248 1000360+ 20 0 1484936 36464 21300 S 74.3 0.1 0:00.78 collect+ 9440 root 20 0 1633728 545412 60900 S 11.9 1.7 21:06.92 fluentd 2009 root 20 0 3741252 172396 71396 S 4.0 0.5 36:42.83 kubelet 1 root 20 0 238416 15412 8968 S 1.0 0.0 1:56.74 systemd 300 root 0 -20 0 0 0 I 1.0 0.0 0:00.46 kworker+ 1427 root 20 0 19656 2204 2064 S 1.0 0.0 0:01.55 agetty 2419 root 20 0 1714748 38812 22884 S 1.0 0.1 0:24.42 coredns+ 2528 root 20 0 1634680 36464 20628 S 1.0 0.1 0:22.01 dynkeep+ 1009372 root 20 0 0 0 0 I 1.0 0.0 0:00.42 kworker+ 1053353 root 20 0 50200 4012 3292 R 1.0 0.0 0:01.56 toptop - 07:00:06 up 10:10, 0 users, load average: 0.20, 0.24, 0.27 Tasks: 247 total, 1 running, 246 sleeping, 0 stopped, 0 zombie %Cpu(s): 15.3 us, 1.5 sy, 0.0 ni, 82.7 id, 0.1 wa, 0.2 hi, 0.1 si, 0.0 st MiB Mem : 32151.9 total, 22595.9 free, 2185.7 used, 7370.2 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 29416.7 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1061248 1000360+ 20 0 1484936 35740 21428 S 99.0 0.1 0:01.78 collect+ 2009 root 20 0 3741252 172396 71396 S 3.0 0.5 36:42.86 kubelet 9440 root 20 0 1633728 545076 60900 S 2.0 1.7 21:06.94 fluentd 1353 800 10 -10 796808 165380 40916 S 1.0 0.5 2:32.12 ovs-vsw+ 1954 root 20 0 2663680 131452 46156 S 1.0 0.4 6:50.45 crio top - 07:00:07 up 10:10, 0 users, load average: 0.20, 0.24, 0.27 Tasks: 247 total, 1 running, 246 sleeping, 0 stopped, 0 zombie %Cpu(s): 14.7 us, 1.1 sy, 0.0 ni, 83.6 id, 0.1 wa, 0.4 hi, 0.1 si, 0.0 st MiB Mem : 32151.9 total, 22595.9 free, 2185.7 used, 7370.2 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 29416.7 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1061248 1000360+ 20 0 1484936 35236 21492 S 102.0 0.1 0:02.80 collect+ 2009 root 20 0 3741252 172660 71396 S 7.0 0.5 36:42.93 kubelet 3288 nobody 20 0 718964 30648 11680 S 3.0 0.1 3:36.84 node_ex+ 1 root 20 0 238416 15412 8968 S 1.0 0.0 1:56.75 systemd 1353 800 10 -10 796808 165380 40916 S 1.0 0.5 2:32.13 ovs-vsw+ 1954 root 20 0 2663680 131452 46156 S 1.0 0.4 6:50.46 crio 5454 root 20 0 1729112 73680 37404 S 1.0 0.2 3:52.22 coredns 9440 root 20 0 1633728 545080 60900 S 1.0 1.7 21:06.95 fluentd 1053353 root 20 0 50200 4012 3292 R 1.0 0.0 0:01.57 toptop - 07:00:08 up 10:10, 0 users, load average: 0.20, 0.24, 0.27 Tasks: 247 total, 2 running, 245 sleeping, 0 stopped, 0 zombie %Cpu(s): 14.2 us, 0.9 sy, 0.0 ni, 84.5 id, 0.0 wa, 0.2 hi, 0.1 si, 0.0 st MiB Mem : 32151.9 total, 22595.9 free, 2185.7 used, 7370.2 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 29416.7 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1061248 1000360+ 20 0 1484936 35164 21492 S 100.0 0.1 0:03.81 collect+ 2009 root 20 0 3741252 172660 71396 S 3.0 0.5 36:42.96 kubelet 1061543 1000650+ 20 0 34564 9804 5772 R 3.0 0.0 0:00.03 python 9440 root 20 0 1633728 543952 60900 S 2.0 1.7 21:06.97 fluentd 1053353 root 20 0 50200 4012 3292 R 2.0 0.0 0:01.59 top 2330 root 20 0 1654612 61260 34720 S 1.0 0.2 0:55.81 coredns 8023 root 20 0 12056 3044 2580 S 1.0 0.0 0:24.59 install+top - 07:00:09 up 10:10, 0 users, load average: 0.34, 0.27, 0.28 Tasks: 235 total, 2 running, 233 sleeping, 0 stopped, 0 zombie %Cpu(s): 8.9 us, 3.2 sy, 0.0 ni, 85.6 id, 1.5 wa, 0.5 hi, 0.2 si, 0.0 st MiB Mem : 32151.9 total, 22621.0 free, 2160.5 used, 7370.4 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 29441.9 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2009 root 20 0 3741252 172660 71396 S 5.0 0.5 36:43.01 kubelet 9440 root 20 0 1633728 542684 60900 S 4.0 1.6 21:07.01 fluentd 1353 800 10 -10 796808 165380 40916 S 2.0 0.5 2:32.15 ovs-vsw+ 1 root 20 0 238416 15412 8968 S 1.0 0.0 1:56.76 systemd 1954 root 20 0 2663680 131452 46156 S 1.0 0.4 6:50.47 crio 5454 root 20 0 1729112 73680 37404 S 1.0 0.2 3:52.23 coredns 6198 nobody 20 0 1632272 45936 20516 S 1.0 0.1 0:17.61 network+ 7016 root 20 0 12052 3204 2736 S 1.0 0.0 0:24.19 install+
Version-Release number of selected component (if applicable):
How reproducible:
Lab environment does not present same behavior.
Steps to Reproduce:
1. 2. 3.
Actual results:
Regular high CPU spikes
Expected results:
No CPU spikes
Additional info:
Provided logs: 1-) top command dump uploaded to SF case 03317387 2-) must-gather uploaded to SF case 03317387
When we update a Secret referenced in the BareMetalHost, an immediate reconcile of the corresponding BMH is not triggered. In most states we requeue each CR after a timeout, so we should eventually see the changes.
In the case of BMC Secrets, this has been broken since the fix for OCPBUGS-1080 in 4.12.
Description of problem:
PipelineRun has Duration column and inside it TaskRun - doesn't
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Have OpenShift Pipeline with 2+ tasks configured and invoked
Steps to Reproduce:
1. Once PipelineRun is invoked - navigate to invoked TaskRuns 2. You will see there columns like Status, Started, but no Duration
Actual results:
Expected results:
Additional info:
I'll add screenshots for PipelineRuns and TaskRuns
Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/47
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
After all cluster operators have reconciled after the password rotation, we can still see authentication failures in keystone (attached screenshot of splunk query)
Version-Release number of selected component (if applicable):
Environment: - OpenShift 4.12.10 on OpenStack 16 - The cluster is managed via RHACM, but password rotation shall be done via "regular" OpenShift means.
How reproducible:
Rotated the OpenStack credentials according to the documentation [1] [1] https://docs.openshift.com/container-platform/4.12/authentication/managing_cloud_provider_credentials/cco-mode-passthrough.html#manually-rotating-cloud-creds_cco-mode-passthrough
Additional info:
- we can't trace back where these authentication failures come from - they do disappear after a cluster upgrade (so when nodes are rebooted and all pods are restarted which indicates that there's still a component using the old credentials) - The relevant technical integration points _seem_ to be working though (LBaaS, CSI, Machine API, Swift)
What is the business impact? Please also provide timeframe information.
- We cannot rely on splunk monitoring for authentication issues since it's currently constantly showing authentication errors - We cannot be entirely sure that everything works as expected since we don't know the component that doesn't seem to use the new credentials
Description of problem:
E2E test suite is getting failed with below error - Falling back to built-in suite, failed reading external test suites: unable to extract k8s-tests binary: failed extracting "/usr/bin/k8s-tests" from "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f98d9998691052cb8049f806f8c1dc9a6bac189c10c33af9addd631eedfb5528": exit status 1 No manifest filename passed
Version-Release number of selected component (if applicable):
4.14
How reproducible:
So far with 4.14 clusters on Power
Steps to Reproduce:
1. Deploy 4.14 cluster on Power 2. Run e2e test suite from - https://github.com/openshift/origin 3. Monitor e2e
Actual results:
E2E test failed
Expected results:
E2E should pass
Additional info:
./openshift-tests run -f ./test-suite.txt -o /tmp/conformance-parallel-out.txt warning: KUBE_TEST_REPO_LIST may not be set when using openshift-tests and will be ignored openshift-tests version: v4.1.0-6960-gd9cf51f Aug 9 00:48:21.959: INFO: Enabling in-tree volume drivers Attempting to pull tests from external binary... Falling back to built-in suite, failed reading external test suites: unable to extract k8s-tests binary: failed extracting "/usr/bin/k8s-tests" from "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f98d9998691052cb8049f806f8c1dc9a6bac189c10c33af9addd631eedfb5528": exit status 1 creating a TCP service service-test with type=LoadBalancer in namespace e2e-service-lb-test-bvmbl Aug 9 00:48:35.424: INFO: Waiting up to 15m0s for service "service-test" to have a LoadBalancer Aug 9 00:48:36.272: INFO: ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/new started responding to GET requests over new connections Aug 9 00:48:36.272: INFO: ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/reused started responding to GET requests over reused connections Aug 9 00:48:36.310: INFO: ns/openshift-console route/console disruption/ingress-to-console connection/new started responding to GET requests over new connections Aug 9 00:48:36.310: INFO: ns/openshift-console route/console disruption/ingress-to-console connection/reused started responding to GET requests over reused connections Aug 9 01:04:07.507: INFO: disruption/ci-cluster-network-liveness connection/reused started responding to GET requests over reused connections Aug 9 01:04:07.507: INFO: disruption/ci-cluster-network-liveness connection/new started responding to GET requests over new connections Starting SimultaneousPodIPController I0809 01:04:37.551879 134117 shared_informer.go:311] Waiting for caches to sync for SimultaneousPodIPController Aug 9 01:04:37.558: INFO: ns/openshift-image-registry route/test-disruption-reused disruption/image-registry connection/reused started responding to GET requests over reused connections Aug 9 01:04:37.624: INFO: disruption/cache-kube-api connection/new started responding to GET requests over new connections E0809 01:04:37.719406 134117 shared_informer.go:314] unable to sync caches for SimultaneousPodIPControllerSuite run returned error: error waiting for load balancer: timed out waiting for service "service-test" to have a load balancer: timed out waiting for the condition disruption/kube-api connection/new producer sampler context is done disruption/cache-kube-api connection/reused producer sampler context is done disruption/oauth-api connection/new producer sampler context is done disruption/oauth-api connection/reused producer sampler context is done ERRO[0975] disruption sample failed: context canceled auditID=464fb276-71b0-48bf-8fb4-3099ae37cedf backend=oauth-api type=reused disruption/cache-kube-api connection/new producer sampler context is done disruption/openshift-api connection/reused producer sampler context is done disruption/cache-openshift-api connection/reused producer sampler context is done ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/new producer sampler context is done ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/reused producer sampler context is done ns/openshift-console route/console disruption/ingress-to-console connection/new producer sampler context is done disruption/ci-cluster-network-liveness connection/reused producer sampler context is done disruption/ci-cluster-network-liveness connection/new producer sampler context is done ns/openshift-image-registry route/test-disruption-new disruption/image-registry connection/new producer sampler context is done ns/openshift-image-registry route/test-disruption-reused disruption/image-registry connection/reused producer sampler context is done ns/openshift-console route/console disruption/ingress-to-console connection/reused producer sampler context is done disruption/kube-api connection/reused producer sampler context is done disruption/openshift-api connection/new producer sampler context is done disruption/cache-openshift-api connection/new producer sampler context is done disruption/cache-oauth-api connection/reused producer sampler context is done disruption/cache-oauth-api connection/new producer sampler context is done Shutting down SimultaneousPodIPController SimultaneousPodIPController shut down No manifest filename passed error running options: error waiting for load balancer: timed out waiting for service "service-test" to have a load balancer: timed out waiting for the conditionerror: error waiting for load balancer: timed out waiting for service "service-test" to have a load balancer: timed out waiting for the condition
This is a clone of issue OCPBUGS-11286. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
OCP 4.13.0-0.nightly-2023-03-23-204038 ODF 4.13.0-121.stable
How reproducible:
Steps to Reproduce:
1. Installed ODF over OCP, everything was fine on the Installed Operators page. 2. Later when checked Installed Operators page, it crashed with "Oh no! Something went wrong" error. 3.
Actual results:
Installed Operators page crashes with "Oh no! Something went wrong." error
Expected results:
Installed Operators page shouldn't crash Component and Stack trace logs from the console page- http://pastebin.test.redhat.com/1096522
Additional info:
Description of problem:
Customer has noticed that object count quotas ("count/*") do not work for certain objects in ClusterResourceQuotas. For example, the following ResourceQuota works as expected: ~~~ apiVersion: v1 kind: ResourceQuota metadata: [..] spec: hard: count/routes.route.openshift.io: "900" count/servicemonitors.monitoring.coreos.com: "100" pods: "100" status: hard: count/routes.route.openshift.io: "900" count/servicemonitors.monitoring.coreos.com: "100" pods: "100" used: count/routes.route.openshift.io: "0" count/servicemonitors.monitoring.coreos.com: "1" pods: "4" ~~~ However when using "count/servicemonitors.monitoring.coreos.com" in ClusterResourceQuotas, this does not work (note the missing "used"): ~~~ apiVersion: quota.openshift.io/v1 kind: ClusterResourceQuota metadata: [..] spec: quota: hard: count/routes.route.openshift.io: "900" count/servicemonitors.monitoring.coreos.com: "100" count/simon.krenger.ch: "100" pods: "100" selector: annotations: openshift.io/requester: kube:admin status: namespaces: [..] total: hard: count/routes.route.openshift.io: "900" count/servicemonitors.monitoring.coreos.com: "100" count/simon.krenger.ch: "100" pods: "100" used: count/routes.route.openshift.io: "0" pods: "4" ~~~ This behaviour does not only apply to "servicemonitors.monitoring.coreos.com" objects, but also to other objects, such as: - count/kafkas.kafka.strimzi.io: '0' - count/prometheusrules.monitoring.coreos.com: '100' - count/servicemonitors.monitoring.coreos.com: '100' The debug output for kube-controller-manager shows the following entries, which may or may not be related: ~~~ $ oc logs kube-controller-manager-ip-10-0-132-228.eu-west-1.compute.internal | grep "servicemonitor" I0511 15:07:17.297620 1 patch_informers_openshift.go:90] Couldn't find informer for monitoring.coreos.com/v1, Resource=servicemonitors I0511 15:07:17.297630 1 resource_quota_monitor.go:181] QuotaMonitor using a shared informer for resource "monitoring.coreos.com/v1, Resource=servicemonitors" I0511 15:07:17.297642 1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for servicemonitors.monitoring.coreos.com [..] I0511 15:07:17.486279 1 patch_informers_openshift.go:90] Couldn't find informer for monitoring.coreos.com/v1, Resource=servicemonitors I0511 15:07:17.486297 1 graph_builder.go:176] using a shared informer for resource "monitoring.coreos.com/v1, Resource=servicemonitors", kind "monitoring.coreos.com/v1, Kind=ServiceMonitor" ~~~
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.12.15
How reproducible:
Always
Steps to Reproduce:
1. On an OCP 4.12 cluster, create the following ClusterResourceQuota: ~~~ apiVersion: quota.openshift.io/v1 kind: ClusterResourceQuota metadata: name: case-03509174 spec: quota: hard: count/servicemonitors.monitoring.coreos.com: "100" pods: "100" selector: annotations: openshift.io/requester: "kube:admin" ~~~ 2. As "kubeadmin", create a new project and deploy one new ServiceMonitor, for example: ~~~ apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: simon-servicemon-2 namespace: simon-1 spec: endpoints: - path: /metrics port: http scheme: http jobLabel: component selector: matchLabels: deployment: echoenv-1 ~~~
Actual results:
The "used" field for ServiceMonitors is not populated in the ClusterResourceQuota for certain objects. It is unclear if these quotas are enforced or not
Expected results:
ClusterResourceQuota for ServiceMonitors is updated and enforced
Additional info:
* Must-gather for a cluster showing this behaviour (added debug for kube-controller-manager) is available here: https://drive.google.com/file/d/1ioEEHZQVHG46vIzDdNm6pwiTjkL9QQRE/view?usp=share_link * Slack discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1683876047243989
Description of problem:
oc adm inspect generated files sometime have the leading "---" and some time do not. This depends on the order of objects collected. This by itself is not an issue. However this becomes an issue when combined with multiple invocations of oc adm inspect and collecting data to the same directory like must-gather does. If an object is collected multiple times then the second time oc might overwrite the original file improperly and leave 4 bytes of the original content behind. This is happening when not writing the "---\n" in the second invocation as this makes the content 4B shorter and the original tailing 4B are left in the file intact. This garbage confuses YAML parsers.
Version-Release number of selected component (if applicable):
4.14 nighly as of Jul 25 and before
How reproducible:
Always
Steps to Reproduce:
Run oc adm inspect twice with different order of objects: [msivak@x openshift-must-gather]$ oc adm inspect performanceprofile,machineconfigs,nodes --dest-dir=inspect.dual --all-namespaces [msivak@x openshift-must-gather]$ oc adm inspect nodes --dest-dir=inspect.dual --all-namespaces And then check the alphabetically first node yaml file - it will have garbage at the end of the file.
Actual results:
Garbage at the end of the file.
Expected results:
No garbage.
Additional info:
I believe this is caused by the lack of Truncate mode here https://github.com/openshift/oc/blob/master/pkg/cli/admin/inspect/writer.go#L54 Collecting data multiple times cannot be easily avoided when multiple collect scripts are combined with relatedObjects requested by operators.
Description of problem:
CVO is observing panic and throwing following error Interface conversion: cache.DeletedFinalStateUnknown is not v1.Object: missing method GetAnnotations Linking the job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1687876857824808960 Observed on other jobs https://search.ci.openshift.org/?search=cache.DeletedFinalStateUnknown+is+not+v1.Object&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Currently the external-dns image is hardcoded
https://github.com/openshift/hypershift/blob/3b73a1a243122b9cb78ebc9848b7af158142d2d2/cmd/install/install.go#L513
hypershift install should have some method of overriding this
Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/531
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The ovnver and ovsver args should be used even to infer to short versions of the RPMs to install in the sdn container images
Sanitize OWNERS/OWNER_ALIASES:
1) OWNERS must have:
component: "Storage / Kubernetes External Components"
2) OWNER_ALIASES must have all team members of Storage team.
Description of the problem:
We are turning on the feature-usage flag for custom manifests whenever we are crating a new custom cluster manifest. When we delete that manifest the flag is stays on.
Expected results:
Need to turn off the flag when deleting the custom manifest
Description of problem:
The current openshift_sdn_pod_operations_latency metrics is broken which is not calculating actual duration of setup/teardown for the latency metric. We also need additional metrics to measure the pod latency from end to end so that it gives overall summary for total processing time spent by cni server.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-18841. The following is the description of the original issue:
—
Description of problem:
Failed to run auto OCP-57089 on a 4.14 azure platform, manually checked it, the created load-balancer service couldn't get an external-IP address
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-09-164123
How reproducible:
100% on the cluster
Steps to Reproduce:
1. Add a wait in the auto script, then run the case g.By("check if the lb services have obtained the EXTERNAL-IPs") regExp := "([0-9]+.[0-9]+.[0-9]+.[0-9]+)" time.Sleep(3600 * time.Second) % ./bin/extended-platform-tests run all --dry-run | grep 57089 | ./bin/extended-platform-tests run -f - 2. % oc get ns | grep e2e-test-router e2e-test-router-ingressclass-n2z2c Active 2m51s 3. It was pending in EXTERNAL-IP column for internal-lb-57089 service % oc -n e2e-test-router-ingressclass-n2z2c get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE external-lb-57089 LoadBalancer 172.30.198.7 20.42.34.61 28443:30193/TCP 3m6s internal-lb-57089 LoadBalancer 172.30.214.30 <pending> 29443:31507/TCP 3m6s service-secure ClusterIP 172.30.47.70 <none> 27443/TCP 3m13s service-unsecure ClusterIP 172.30.175.59 <none> 27017/TCP 3m13s % 4. % oc -n e2e-test-router-ingressclass-n2z2c get svc internal-lb-57089 -oyaml apiVersion: v1 kind: Service metadata: annotations: service.beta.kubernetes.io/azure-load-balancer-internal: "true" creationTimestamp: "2023-09-12T07:56:42Z" finalizers: - service.kubernetes.io/load-balancer-cleanup name: internal-lb-57089 namespace: e2e-test-router-ingressclass-n2z2c resourceVersion: "209376" uid: b163bc03-b1c6-4e7b-b4e1-c996e9d135f4 spec: allocateLoadBalancerNodePorts: true clusterIP: 172.30.214.30 clusterIPs: - 172.30.214.30 externalTrafficPolicy: Cluster internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: https nodePort: 31507 port: 29443 protocol: TCP targetPort: 8443 selector: name: web-server-rc sessionAffinity: None type: LoadBalancer status: loadBalancer: {} %
Actual results:
internal-lb-57089 service couldn't get an external-IP address
Expected results:
internal-lb-57089 service can get an external-IP address
Additional info:
Please review the following PR: https://github.com/openshift/router/pull/453
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-19918. The following is the description of the original issue:
—
Description of problem:
Issue was found when analyzing bug https://issues.redhat.com/browse/OCPBUGS-19817
Version-Release number of selected component (if applicable):
4.15.0-0.ci-2023-09-25-165744
How reproducible:
everytime
Steps to Reproduce:
The cluster is ipsec cluster and enabled NS extension and ipsec service. 1. enable e-w ipsec & wait for cluster to settle 2. disable ipsec & wait for cluster to settle you'll observer ipsec pods are deleted
Actual results:
no pods
Expected results:
pods should stay see https://github.com/openshift/cluster-network-operator/blob/master/pkg/network/ovn_kubernetes.go#L314 // If IPsec is enabled for the first time, we start the daemonset. If it is // disabled after that, we do not stop the daemonset but only stop IPsec. // // TODO: We need to do this as, by default, we maintain IPsec state on the // node in order to maintain encrypted connectivity in the case of upgrades. // If we only unrender the IPsec daemonset, we will be unable to cleanup // the IPsec state on the node and the traffic will continue to be // encrypted.
Additional info:
Description of problem:
agent-gather script does not collect agent-tui logs
Version-Release number of selected component (if applicable):
How reproducible:
Login into a node (before bootstrap is completed), and run agent-gather script
Steps to Reproduce:
1. ssh into one of the node 2. run agent-gather 3. Check the content of the produced tar artifacts
Actual results:
The agent-gather-*.tar.xz does not contain agent-tui logs
Expected results:
The agent-gather-*.tar.xz must contain /var/log/agent/agent-tui.log
Additional info:
agent-tui logs are fundamental to troubleshoot any eventual issue that could happen during the bootstrap, affecting the agent-tui console.
Description of problem:
When deploy 4.12 spoke clusters(using rhcos-412.86.202306132230-0-live.x86_64.iso) or 4.10 spoke clusters from a 4.14.0-ec.4 hub, bmh gets stuck in provisioning state due to Failed to update hostname: Command '['chroot', '/mnt/coreos', 'hostnamectl', 'hostname']' returned non-zero exit status 1. Running `hostnamectl hostname` returns `Unknown operation hostname`. It looks like older versions of hostnamectl do not support the hostname option.
Version-Release number of selected component (if applicable):
4.14.0-ec.4
How reproducible:
100%
Steps to Reproduce:
1. From a 4.14.0-ec.4 hub cluster deploy a 4.12 spoke cluster using rhcos-412.86.202306132230-0-live.x86_64.iso via ZTP procedure
Actual results:
BMH stuck in provisioning state
Expected results:
BMH gets provisioned
Additional info:
I also tried using a 4.14 iso image to deploy the 4.12 payload but then kubelet would fail with err="failed to parse kubelet flag: unknown flag: --container-runtime"
MGMT-7549 added a change to use openshift-install instead of openshift-baremetal-install for platform:none clusters. This was to work around a problem where the baremetal binary was not available for an ARM target cluster, and at the time only none platform was supported on ARM. This problem was resolved by MGMT-9206, so we no longer need the workaround.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/230
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
oc login --token=$token
--server=https://api.dalh-dev-hs-2.05zb.p3.openshiftapps.com:443 --certificate-authority=ca.crt
The server uses a certificate signed by an unknown authority.
You can bypass the certificate check, but any data you send to the server could be intercepted by others.
The referenced "ca.crt" comes from the Secret created when a Service Account is created.
Version-Release number of selected component (if applicable): 4.12.12
How reproducible: Always
Description of problem:
etcd pods running in a hypershift control plane use an exec probe to check cluster health and have a very small timeout (1s). We should be using the same as standalone etcd with a 30s timeout
Version-Release number of selected component (if applicable):
All
How reproducible:
Always
Steps to Reproduce:
1. Create a hypershift hosted cluster 2. Examine etcd pod(s) yaml
Actual results:
Probe is of type exec and has a timeout of 1s
Expected results:
Probe is of type http and has a timeout of 30s
Additional info:
Description of problem:
CU wanted to restrict access to vcenter API and originating traffic needs to use a configured EgressIP. This is working fine for the machine API but the vsphere CSI driver controller uses the host network and hence the configured EgressIP isn't used.
Is it possible to disable this( use of host-network) for CSI controller?
slack thread: https://redhat-internal.slack.com/archives/CBQHQFU0N/p1683135077822559
Description of problem:
APIServer endpoint isn't healthy after a PublicAndPrivate cluster is created. PROGRESS of the cluster is Completed and PROCESS is false, Nodes are ready, cluster operators on the guest cluster are Available, only issue is condition Type Available is False due to APIServer endpoint is not healthy. jiezhao-mac:hypershift jiezhao$ oc get hostedcluster -n clusters NAME VERSION KUBECONFIG PROGRESS AVAILABLE PROGRESSING MESSAGE jz-test 4.14.0-0.nightly-2023-04-30-235516 jz-test-admin-kubeconfig Completed False False APIServer endpoint a23663b1e738a4d6783f6256da73fe76-2649b36a23f49ed7.elb.us-east-2.amazonaws.com is not healthy jiezhao-mac:hypershift jiezhao$ oc get hostedcluster/jz-test -n clusters -ojsonpath='{.spec.platform.aws.endpointAccess}{"\n"}' PublicAndPrivate jiezhao-mac:hypershift jiezhao$ oc get pods -n clusters-jz-test NAME READY STATUS RESTARTS AGE aws-cloud-controller-manager-666559d4f-rdsw4 2/2 Running 0 149m aws-ebs-csi-driver-controller-79fdfb6c76-vb7wr 7/7 Running 0 148m aws-ebs-csi-driver-operator-7dbd789984-mb9rp 1/1 Running 0 148m capi-provider-5b7847db9-nlrvz 2/2 Running 0 151m catalog-operator-7ccb468d86-7c5j6 2/2 Running 0 149m certified-operators-catalog-895787778-5rjb6 1/1 Running 0 149m cloud-network-config-controller-86698fd7dd-kgzhv 3/3 Running 0 148m cluster-api-6fd4f86878-hjw59 1/1 Running 0 151m cluster-autoscaler-bdd688949-f9xmk 1/1 Running 0 150m cluster-image-registry-operator-6f5cb67d88-8svd6 3/3 Running 0 149m cluster-network-operator-7bc69f75f4-npjfs 1/1 Running 0 149m cluster-node-tuning-operator-5855b6576b-rckhh 1/1 Running 0 149m cluster-policy-controller-56d4d6b57c-glx4w 1/1 Running 0 149m cluster-storage-operator-7cc56c68bb-jd4d2 1/1 Running 0 149m cluster-version-operator-bd969b677-bh4w4 1/1 Running 0 149m community-operators-catalog-5c545484d7-hbzb4 1/1 Running 0 149m control-plane-operator-fc49dcbb4-5ncvf 2/2 Running 0 151m csi-snapshot-controller-85f7cc9945-n5vgq 1/1 Running 0 149m csi-snapshot-controller-operator-6597b45897-hqf5p 1/1 Running 0 149m csi-snapshot-webhook-644d765546-lk9hj 1/1 Running 0 149m dns-operator-5b5577d6c7-8dh8d 1/1 Running 0 149m etcd-0 2/2 Running 0 150m hosted-cluster-config-operator-5b75ccf55d-6rzch 1/1 Running 0 149m ignition-server-596fc9d9fb-sb94h 1/1 Running 0 150m ingress-operator-6497d476bc-whssz 3/3 Running 0 149m konnectivity-agent-6656d8dfd6-h5tcs 1/1 Running 0 150m konnectivity-server-5ff9d4b47-stb2m 1/1 Running 0 150m kube-apiserver-596fc4bb8b-7kfd8 3/3 Running 0 150m kube-controller-manager-6f86bb7fbd-4wtxk 1/1 Running 0 138m kube-scheduler-bf5876b4b-flk96 1/1 Running 0 149m machine-approver-574585d8dd-h5ffh 1/1 Running 0 150m multus-admission-controller-67b6f85fbf-bfg4x 2/2 Running 0 148m oauth-openshift-6b6bfd55fb-8sdq7 2/2 Running 0 148m olm-operator-5d97fb977c-sbf6w 2/2 Running 0 149m openshift-apiserver-5bb9f99974-2lfp4 3/3 Running 0 138m openshift-controller-manager-65666bdf79-g8cf5 1/1 Running 0 149m openshift-oauth-apiserver-56c8565bb6-6b5cv 2/2 Running 0 149m openshift-route-controller-manager-775f844dfc-jj2ft 1/1 Running 0 149m ovnkube-master-0 7/7 Running 0 148m packageserver-6587d9674b-6jwpv 2/2 Running 0 149m redhat-marketplace-catalog-5f6d45b457-hdn77 1/1 Running 0 149m redhat-operators-catalog-7958c4449b-l4hbx 1/1 Running 0 12m router-5b7899cc97-chs6t 1/1 Running 0 150m jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig NAME STATUS ROLES AGE VERSION ip-10-0-137-99.us-east-2.compute.internal Ready worker 131m v1.26.2+d2e245f ip-10-0-140-85.us-east-2.compute.internal Ready worker 132m v1.26.2+d2e245f ip-10-0-141-46.us-east-2.compute.internal Ready worker 131m v1.26.2+d2e245f jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get co --kubeconfig=hostedcluster.kubeconfig NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE console 4.14.0-0.nightly-2023-04-30-235516 True False False 126m csi-snapshot-controller 4.14.0-0.nightly-2023-04-30-235516 True False False 140m dns 4.14.0-0.nightly-2023-04-30-235516 True False False 129m image-registry 4.14.0-0.nightly-2023-04-30-235516 True False False 128m ingress 4.14.0-0.nightly-2023-04-30-235516 True False False 129m insights 4.14.0-0.nightly-2023-04-30-235516 True False False 130m kube-apiserver 4.14.0-0.nightly-2023-04-30-235516 True False False 140m kube-controller-manager 4.14.0-0.nightly-2023-04-30-235516 True False False 140m kube-scheduler 4.14.0-0.nightly-2023-04-30-235516 True False False 140m kube-storage-version-migrator 4.14.0-0.nightly-2023-04-30-235516 True False False 129m monitoring 4.14.0-0.nightly-2023-04-30-235516 True False False 129m network 4.14.0-0.nightly-2023-04-30-235516 True False False 140m node-tuning 4.14.0-0.nightly-2023-04-30-235516 True False False 131m openshift-apiserver 4.14.0-0.nightly-2023-04-30-235516 True False False 140m openshift-controller-manager 4.14.0-0.nightly-2023-04-30-235516 True False False 140m openshift-samples 4.14.0-0.nightly-2023-04-30-235516 True False False 129m operator-lifecycle-manager 4.14.0-0.nightly-2023-04-30-235516 True False False 140m operator-lifecycle-manager-catalog 4.14.0-0.nightly-2023-04-30-235516 True False False 140m operator-lifecycle-manager-packageserver 4.14.0-0.nightly-2023-04-30-235516 True False False 140m service-ca 4.14.0-0.nightly-2023-04-30-235516 True False False 130m storage 4.14.0-0.nightly-2023-04-30-235516 True False False 131m jiezhao-mac:hypershift jiezhao$ HC conditions: ============== status: conditions: - lastTransitionTime: "2023-05-01T19:45:49Z" message: All is well observedGeneration: 3 reason: AsExpected status: "True" type: ValidAWSIdentityProvider - lastTransitionTime: "2023-05-01T20:00:18Z" message: Cluster version is 4.14.0-0.nightly-2023-04-30-235516 observedGeneration: 3 reason: FromClusterVersion status: "False" type: ClusterVersionProgressing - lastTransitionTime: "2023-05-01T19:46:22Z" message: Payload loaded version="4.14.0-0.nightly-2023-04-30-235516" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-04-30-235516" architecture="amd64" observedGeneration: 3 reason: PayloadLoaded status: "True" type: ClusterVersionReleaseAccepted - lastTransitionTime: "2023-05-01T20:03:14Z" message: Condition not found in the CVO. observedGeneration: 3 reason: StatusUnknown status: Unknown type: ClusterVersionUpgradeable - lastTransitionTime: "2023-05-01T20:00:18Z" message: Done applying 4.14.0-0.nightly-2023-04-30-235516 observedGeneration: 3 reason: FromClusterVersion status: "True" type: ClusterVersionAvailable - lastTransitionTime: "2023-05-01T20:00:18Z" message: "" observedGeneration: 3 reason: FromClusterVersion status: "True" type: ClusterVersionSucceeding - lastTransitionTime: "2023-05-01T19:47:51Z" message: The hosted cluster is not degraded observedGeneration: 3 reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2023-05-01T19:45:01Z" message: "" observedGeneration: 3 reason: QuorumAvailable status: "True" type: EtcdAvailable - lastTransitionTime: "2023-05-01T19:45:38Z" message: Kube APIServer deployment is available observedGeneration: 3 reason: AsExpected status: "True" type: KubeAPIServerAvailable - lastTransitionTime: "2023-05-01T19:44:27Z" message: All is well observedGeneration: 3 reason: AsExpected status: "True" type: InfrastructureReady - lastTransitionTime: "2023-05-01T19:44:11Z" message: External DNS is not configured observedGeneration: 3 reason: StatusUnknown status: Unknown type: ExternalDNSReachable - lastTransitionTime: "2023-05-01T19:44:19Z" message: Configuration passes validation observedGeneration: 3 reason: AsExpected status: "True" type: ValidHostedControlPlaneConfiguration - lastTransitionTime: "2023-05-01T19:44:11Z" message: AWS KMS is not configured observedGeneration: 3 reason: StatusUnknown status: Unknown type: ValidAWSKMSConfig - lastTransitionTime: "2023-05-01T19:44:37Z" message: All is well observedGeneration: 3 reason: AsExpected status: "True" type: ValidReleaseInfo - lastTransitionTime: "2023-05-01T19:44:11Z" message: APIServer endpoint a23663b1e738a4d6783f6256da73fe76-2649b36a23f49ed7.elb.us-east-2.amazonaws.com is not healthy observedGeneration: 3 reason: waitingForAvailable status: "False" type: Available - lastTransitionTime: "2023-05-01T19:47:18Z" message: All is well reason: AWSSuccess status: "True" type: AWSEndpointAvailable - lastTransitionTime: "2023-05-01T19:47:18Z" message: All is well reason: AWSSuccess status: "True" type: AWSEndpointServiceAvailable - lastTransitionTime: "2023-05-01T19:44:11Z" message: Configuration passes validation observedGeneration: 3 reason: AsExpected status: "True" type: ValidConfiguration - lastTransitionTime: "2023-05-01T19:44:11Z" message: HostedCluster is supported by operator configuration observedGeneration: 3 reason: AsExpected status: "True" type: SupportedHostedCluster - lastTransitionTime: "2023-05-01T19:45:39Z" message: Ignition server deployment is available observedGeneration: 3 reason: AsExpected status: "True" type: IgnitionEndpointAvailable - lastTransitionTime: "2023-05-01T19:44:11Z" message: Reconciliation active on resource observedGeneration: 3 reason: AsExpected status: "True" type: ReconciliationActive - lastTransitionTime: "2023-05-01T19:44:12Z" message: Release image is valid observedGeneration: 3 reason: AsExpected status: "True" type: ValidReleaseImage - lastTransitionTime: "2023-05-01T19:44:12Z" message: HostedCluster is at expected version observedGeneration: 3 reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2023-05-01T19:44:13Z" message: OIDC configuration is valid observedGeneration: 3 reason: AsExpected status: "True" type: ValidOIDCConfiguration - lastTransitionTime: "2023-05-01T19:44:13Z" message: Reconciliation completed succesfully observedGeneration: 3 reason: ReconciliatonSucceeded status: "True" type: ReconciliationSucceeded - lastTransitionTime: "2023-05-01T19:45:52Z" message: All is well observedGeneration: 3 reason: AsExpected status: "True" type: AWSDefaultSecurityGroupCreated kube-apiserver log: ================== E0501 19:45:07.024278 7 memcache.go:238] couldn't get current server API group list: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_03_authorization-openshift_01_rolebindingrestriction.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_03_config-operator_01_proxy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_03_quota-openshift_01_clusterresourcequota.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_03_security-openshift_01_scc.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_03_securityinternal-openshift_02_rangeallocation.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_apiserver-Default.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_authentication.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_build.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_console.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_dns.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_featuregate.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_image.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_imagecontentpolicy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_imagecontentsourcepolicy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_imagedigestmirrorset.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_imagetagmirrorset.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_infrastructure-Default.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_ingress.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_network.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_node.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_oauth.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_project.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused unable to recognize "/work/0000_10_config-operator_01_scheduler.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create a PublicAndPrivate cluster
Actual results:
APIServer endpoint is not healthy, and HC condition Type 'Available' is False
Expected results:
APIServer endpoint should be healthy, and Type 'Available' should be True
Additional info:
Description of problem:
console will have panic error when duplicate entry is set in spec.plugins
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2022-12-19-122634
How reproducible:
Always
Steps to Reproduce:
1. Create console-demo-plugin manifests $ oc apply -f dynamic-demo-plugin/oc-manifest.yaml namespace/console-demo-plugin created deployment.apps/console-demo-plugin created service/console-demo-plugin created consoleplugin.console.openshift.io/console-demo-plugin created 2.Enable console-demo-plugin $ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-demo-plugin"] } }' --type=merge console.operator.openshift.io/cluster patched 3. Add a duplicate entry in spec.plugins in consoles.operator/cluster $ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-demo-plugin", "console-demo-plugin"] } }' --type=merge console.operator.openshift.io/cluster patched $ oc get consoles.operator cluster -o json | jq .spec.plugins [ "console-demo-plugin", "console-demo-plugin" ] 4. check console pods status $ oc get pods -n openshift-console NAME READY STATUS RESTARTS AGE console-6bcc87c7b4-6g2cf 0/1 CrashLoopBackOff 1 (21s ago) 50s console-6bcc87c7b4-9g6kk 0/1 CrashLoopBackOff 3 (3s ago) 50s console-7dc78ffd78-sxvcv 1/1 Running 0 2m58s downloads-758fc74758-9k426 1/1 Running 0 3h18m downloads-758fc74758-k4q72 1/1 Running 0 3h21m
Actual results:
3. console pods will be in CrashLoopBackOff status $ oc logs console-6bcc87c7b4-9g6kk -n openshift-console W1220 06:48:37.279871 1 main.go:228] Flag inactivity-timeout is set to less then 300 seconds and will be ignored! I1220 06:48:37.279889 1 main.go:238] The following console plugins are enabled: I1220 06:48:37.279898 1 main.go:240] - console-demo-plugin I1220 06:48:37.279911 1 main.go:354] cookies are secure! I1220 06:48:37.331802 1 server.go:607] The following console endpoints are now proxied to these services: I1220 06:48:37.331843 1 server.go:610] - /api/proxy/plugin/console-demo-plugin/thanos-querier/ -> https://thanos-querier.openshift-monitoring.svc.cluster.local:9091 I1220 06:48:37.331884 1 server.go:610] - /api/proxy/plugin/console-demo-plugin/thanos-querier/ -> https://thanos-querier.openshift-monitoring.svc.cluster.local:9091 panic: http: multiple registrations for /api/proxy/plugin/console-demo-plugin/thanos-querier/goroutine 1 [running]: net/http.(*ServeMux).Handle(0xc0005b6600, {0xc0005d9a40, 0x35}, {0x35aaf60?, 0xc000735260}) /usr/lib/golang/src/net/http/server.go:2503 +0x239 github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func1({0xc0005d9940?, 0x35?}, {0x35aaf60, 0xc000735260}) /go/src/github.com/openshift/console/pkg/server/server.go:245 +0x149 github.com/openshift/console/pkg/server.(*Server).HTTPHandler(0xc000056c00) /go/src/github.com/openshift/console/pkg/server/server.go:621 +0x330b main.main() /go/src/github.com/openshift/console/cmd/bridge/main.go:785 +0x5ff5
Expected results:
3. console pods should be running well
Additional info:
Node healthz server was added in 4.13 with https://github.com/openshift/ovn-kubernetes/commit/c8489e3ff9c321e77f265dc9d484ed2549df4a6b and https://github.com/openshift/ovn-kubernetes/commit/9a836e3a547f3464d433ce8b9eef336624d51858. We need to configure it by default on 0.0.0.0:10256 on CNO for ovnk, just like we do for sdn.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/221
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If both the below mentioned annotations are used on an operator CSV, the uninstall instructions don't show up on the UI. - console.openshift.io/disable-operand delete: "true" - operator.openshift.io/uninstall-message: "some message"
Version-Release number of selected component (if applicable):
➜ $> oc version Client Version: 4.12.0 Kustomize Version: v4.5.7 Server Version: 4.13.0-rc.5 Kubernetes Version: v1.26.3+379cd9f ➜ $> oc get co | grep console console 4.13.0-rc.5 True False False 4h49m
How reproducible:
Always
Steps to Reproduce:
1.Add both the mentioned annotations on an operator CSV. 2.Make sure "console.openshift.io/disable-operand delete" is set to "true". 3.Upon clicking "Uninstall operator", the result can be observed on the pop-up.
Actual results:
The uninstall pop-up doesn't have the "Message from Operator developer" section.
Expected results:
The uninstall instructions should show up under "Message from Operator developer".
Additional info:
The two annotations seemed to be linked here, https://github.com/openshift/console/blob/3e0bb0928ce09030bc3340c9639b2a1df9e0a007/frontend/packages/operator-lifecycle-manager/src/components/modals/uninstall-operator-modal.tsx#LL395C10-L395C26
The version tracker needs an update.
When the ingress operator creates or updates a router deployment that specifies spec.template.spec.hostNetwork: true, the operator does not set spec.template.spec.containers[*].ports[*].hostPort. As a result, the API sets each port's hostPort field to the port's containerPort field value. The operator detects this as an external update and attempts to revert it. The operator should not update the deployment in response to API defaulting.
I observed this in CI for OCP 4.14 and was able to reproduce the issue on OCP 4.11.37. The problematic code was added in https://github.com/openshift/cluster-ingress-operator/pull/694/commits/af653f9fa7368cf124e11b7ea4666bc40e601165 in OCP 4.11 to implement NE-674.
Easily.
1. Create an IngressController that specifies the "HostNetwork" endpoint publishing strategy type:
oc create -f - <<EOF apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: example-hostnetwork namespace: openshift-ingress-operator spec: domain: example.xyz endpointPublishingStrategy: type: HostNetwork EOF
2. Check the ingress operator's logs:
oc -n openshift-ingress-operator logs -c ingress-operator deployments/ingress-operator
The ingress operator logs "updated router deployment" multiple times for the "example-hostnetwork" IngressController, such as the following:
2023-06-15T02:11:47.229Z INFO operator.ingress_controller ingress/deployment.go:131 updated router deployment {"namespace": "openshift-ingress", "name": "router-example-hostnetwork", "diff": " &v1.Deployment{\n \tTypeMeta: {},\n \tObjectMeta: {Name: \"router-example-hostnetwork\", Namespace: \"openshift-ingress\", UID: \"d7c51022-460e-4962-8521-e00255f649c3\", ResourceVersion: \"3356177\", ...},\n \tSpec: v1.DeploymentSpec{\n \t\tReplicas: &2,\n \t\tSelector: &{MatchLabels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"example-hostnetwork\"}},\n \t\tTemplate: v1.PodTemplateSpec{\n \t\t\tObjectMeta: {Labels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"example-hostnetwork\", \"ingresscontroller.operator.openshift.io/hash\": \"b7c697fd\"}, Annotations: {\"target.workload.openshift.io/management\": `{\"effect\": \"PreferredDuringScheduling\"}`, \"unsupported.do-not-use.openshift.io/override-liveness-grace-period-seconds\": \"10\"}},\n \t\t\tSpec: v1.PodSpec{\n \t\t\t\tVolumes: []v1.Volume{\n \t\t\t\t\t{Name: \"default-certificate\", VolumeSource: {Secret: &{SecretName: \"router-certs-example-hostnetwork\", DefaultMode: &420}}},\n \t\t\t\t\t{\n \t\t\t\t\t\tName: \"service-ca-bundle\",\n \t\t\t\t\t\tVolumeSource: v1.VolumeSource{\n \t\t\t\t\t\t\t... // 16 identical fields\n \t\t\t\t\t\t\tFC: nil,\n \t\t\t\t\t\t\tAzureFile: nil,\n \t\t\t\t\t\t\tConfigMap: &v1.ConfigMapVolumeSource{\n \t\t\t\t\t\t\t\tLocalObjectReference: {Name: \"service-ca-bundle\"},\n \t\t\t\t\t\t\t\tItems: {{Key: \"service-ca.crt\", Path: \"service-ca.crt\"}},\n- \t\t\t\t\t\t\t\tDefaultMode: &420,\n+ \t\t\t\t\t\t\t\tDefaultMode: nil,\n \t\t\t\t\t\t\t\tOptional: &false,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tVsphereVolume: nil,\n \t\t\t\t\t\t\tQuobyte: nil,\n \t\t\t\t\t\t\t... // 8 identical fields\n \t\t\t\t\t\t},\n \t\t\t\t\t},\n \t\t\t\t\t{\n \t\t\t\t\t\tName: \"stats-auth\",\n \t\t\t\t\t\tVolumeSource: v1.VolumeSource{\n \t\t\t\t\t\t\t... // 3 identical fields\n \t\t\t\t\t\t\tAWSElasticBlockStore: nil,\n \t\t\t\t\t\t\tGitRepo: nil,\n \t\t\t\t\t\t\tSecret: &v1.SecretVolumeSource{\n \t\t\t\t\t\t\t\tSecretName: \"router-stats-example-hostnetwork\",\n \t\t\t\t\t\t\t\tItems: nil,\n- \t\t\t\t\t\t\t\tDefaultMode: &420,\n+ \t\t\t\t\t\t\t\tDefaultMode: nil,\n \t\t\t\t\t\t\t\tOptional: nil,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tNFS: nil,\n \t\t\t\t\t\t\tISCSI: nil,\n \t\t\t\t\t\t\t... // 21 identical fields\n \t\t\t\t\t\t},\n \t\t\t\t\t},\n \t\t\t\t\t{\n \t\t\t\t\t\tName: \"metrics-certs\",\n \t\t\t\t\t\tVolumeSource: v1.VolumeSource{\n \t\t\t\t\t\t\t... // 3 identical fields\n \t\t\t\t\t\t\tAWSElasticBlockStore: nil,\n \t\t\t\t\t\t\tGitRepo: nil,\n \t\t\t\t\t\t\tSecret: &v1.SecretVolumeSource{\n \t\t\t\t\t\t\t\tSecretName: \"router-metrics-certs-example-hostnetwork\",\n \t\t\t\t\t\t\t\tItems: nil,\n- \t\t\t\t\t\t\t\tDefaultMode: &420,\n+ \t\t\t\t\t\t\t\tDefaultMode: nil,\n \t\t\t\t\t\t\t\tOptional: nil,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tNFS: nil,\n \t\t\t\t\t\t\tISCSI: nil,\n \t\t\t\t\t\t\t... // 21 identical fields\n \t\t\t\t\t\t},\n \t\t\t\t\t},\n \t\t\t\t},\n \t\t\t\tInitContainers: nil,\n \t\t\t\tContainers: []v1.Container{\n \t\t\t\t\t{\n \t\t\t\t\t\t... // 3 identical fields\n \t\t\t\t\t\tArgs: nil,\n \t\t\t\t\t\tWorkingDir: \"\",\n \t\t\t\t\t\tPorts: []v1.ContainerPort{\n \t\t\t\t\t\t\t{\n \t\t\t\t\t\t\t\tName: \"http\",\n- \t\t\t\t\t\t\t\tHostPort: 80,\n+ \t\t\t\t\t\t\t\tHostPort: 0,\n \t\t\t\t\t\t\t\tContainerPort: 80,\n \t\t\t\t\t\t\t\tProtocol: \"TCP\",\n \t\t\t\t\t\t\t\tHostIP: \"\",\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\t{\n \t\t\t\t\t\t\t\tName: \"https\",\n- \t\t\t\t\t\t\t\tHostPort: 443,\n+ \t\t\t\t\t\t\t\tHostPort: 0,\n \t\t\t\t\t\t\t\tContainerPort: 443,\n \t\t\t\t\t\t\t\tProtocol: \"TCP\",\n \t\t\t\t\t\t\t\tHostIP: \"\",\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\t{\n \t\t\t\t\t\t\t\tName: \"metrics\",\n- \t\t\t\t\t\t\t\tHostPort: 1936,\n+ \t\t\t\t\t\t\t\tHostPort: 0,\n \t\t\t\t\t\t\t\tContainerPort: 1936,\n \t\t\t\t\t\t\t\tProtocol: \"TCP\",\n \t\t\t\t\t\t\t\tHostIP: \"\",\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t},\n \t\t\t\t\t\tEnvFrom: nil,\n \t\t\t\t\t\tEnv: {{Name: \"DEFAULT_CERTIFICATE_DIR\", Value: \"/etc/pki/tls/private\"}, {Name: \"DEFAULT_DESTINATION_CA_PATH\", Value: \"/var/run/configmaps/service-ca/service-ca.crt\"}, {Name: \"RELOAD_INTERVAL\", Value: \"5s\"}, {Name: \"ROUTER_ALLOW_WILDCARD_ROUTES\", Value: \"false\"}, ...},\n \t\t\t\t\t\tResources: {Requests: {s\"cpu\": {i: {...}, s: \"100m\", Format: \"DecimalSI\"}, s\"memory\": {i: {...}, Format: \"BinarySI\"}}},\n \t\t\t\t\t\tVolumeMounts: {{Name: \"default-certificate\", ReadOnly: true, MountPath: \"/etc/pki/tls/private\"}, {Name: \"service-ca-bundle\", ReadOnly: true, MountPath: \"/var/run/configmaps/service-ca\"}, {Name: \"stats-auth\", ReadOnly: true, MountPath: \"/var/lib/haproxy/conf/metrics-auth\"}, {Name: \"metrics-certs\", ReadOnly: true, MountPath: \"/etc/pki/tls/metrics-certs\"}},\n \t\t\t\t\t\tVolumeDevices: nil,\n \t\t\t\t\t\tLivenessProbe: &v1.Probe{\n \t\t\t\t\t\t\tProbeHandler: v1.ProbeHandler{\n \t\t\t\t\t\t\t\tExec: nil,\n \t\t\t\t\t\t\t\tHTTPGet: &v1.HTTPGetAction{\n \t\t\t\t\t\t\t\t\tPath: \"/healthz\",\n \t\t\t\t\t\t\t\t\tPort: {IntVal: 1936},\n \t\t\t\t\t\t\t\t\tHost: \"localhost\",\n- \t\t\t\t\t\t\t\t\tScheme: \"HTTP\",\n+ \t\t\t\t\t\t\t\t\tScheme: \"\",\n \t\t\t\t\t\t\t\t\tHTTPHeaders: nil,\n \t\t\t\t\t\t\t\t},\n \t\t\t\t\t\t\t\tTCPSocket: nil,\n \t\t\t\t\t\t\t\tGRPC: nil,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tInitialDelaySeconds: 0,\n \t\t\t\t\t\t\tTimeoutSeconds: 1,\n- \t\t\t\t\t\t\tPeriodSeconds: 10,\n+ \t\t\t\t\t\t\tPeriodSeconds: 0,\n- \t\t\t\t\t\t\tSuccessThreshold: 1,\n+ \t\t\t\t\t\t\tSuccessThreshold: 0,\n- \t\t\t\t\t\t\tFailureThreshold: 3,\n+ \t\t\t\t\t\t\tFailureThreshold: 0,\n \t\t\t\t\t\t\tTerminationGracePeriodSeconds: nil,\n \t\t\t\t\t\t},\n \t\t\t\t\t\tReadinessProbe: &v1.Probe{\n \t\t\t\t\t\t\tProbeHandler: v1.ProbeHandler{\n \t\t\t\t\t\t\t\tExec: nil,\n \t\t\t\t\t\t\t\tHTTPGet: &v1.HTTPGetAction{\n \t\t\t\t\t\t\t\t\tPath: \"/healthz/ready\",\n \t\t\t\t\t\t\t\t\tPort: {IntVal: 1936},\n \t\t\t\t\t\t\t\t\tHost: \"localhost\",\n- \t\t\t\t\t\t\t\t\tScheme: \"HTTP\",\n+ \t\t\t\t\t\t\t\t\tScheme: \"\",\n \t\t\t\t\t\t\t\t\tHTTPHeaders: nil,\n \t\t\t\t\t\t\t\t},\n \t\t\t\t\t\t\t\tTCPSocket: nil,\n \t\t\t\t\t\t\t\tGRPC: nil,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tInitialDelaySeconds: 0,\n \t\t\t\t\t\t\tTimeoutSeconds: 1,\n- \t\t\t\t\t\t\tPeriodSeconds: 10,\n+ \t\t\t\t\t\t\tPeriodSeconds: 0,\n- \t\t\t\t\t\t\tSuccessThreshold: 1,\n+ \t\t\t\t\t\t\tSuccessThreshold: 0,\n- \t\t\t\t\t\t\tFailureThreshold: 3,\n+ \t\t\t\t\t\t\tFailureThreshold: 0,\n \t\t\t\t\t\t\tTerminationGracePeriodSeconds: nil,\n \t\t\t\t\t\t},\n \t\t\t\t\t\tStartupProbe: &v1.Probe{\n \t\t\t\t\t\t\tProbeHandler: v1.ProbeHandler{\n \t\t\t\t\t\t\t\tExec: nil,\n \t\t\t\t\t\t\t\tHTTPGet: &v1.HTTPGetAction{\n \t\t\t\t\t\t\t\t\tPath: \"/healthz/ready\",\n \t\t\t\t\t\t\t\t\tPort: {IntVal: 1936},\n \t\t\t\t\t\t\t\t\tHost: \"localhost\",\n- \t\t\t\t\t\t\t\t\tScheme: \"HTTP\",\n+ \t\t\t\t\t\t\t\t\tScheme: \"\",\n \t\t\t\t\t\t\t\t\tHTTPHeaders: nil,\n \t\t\t\t\t\t\t\t},\n \t\t\t\t\t\t\t\tTCPSocket: nil,\n \t\t\t\t\t\t\t\tGRPC: nil,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tInitialDelaySeconds: 0,\n \t\t\t\t\t\t\tTimeoutSeconds: 1,\n \t\t\t\t\t\t\tPeriodSeconds: 1,\n- \t\t\t\t\t\t\tSuccessThreshold: 1,\n+ \t\t\t\t\t\t\tSuccessThreshold: 0,\n \t\t\t\t\t\t\tFailureThreshold: 120,\n \t\t\t\t\t\t\tTerminationGracePeriodSeconds: nil,\n \t\t\t\t\t\t},\n \t\t\t\t\t\tLifecycle: nil,\n \t\t\t\t\t\tTerminationMessagePath: \"/dev/termination-log\",\n \t\t\t\t\t\t... // 6 identical fields\n \t\t\t\t\t},\n \t\t\t\t},\n \t\t\t\tEphemeralContainers: nil,\n \t\t\t\tRestartPolicy: \"Always\",\n \t\t\t\t... // 31 identical fields\n \t\t\t},\n \t\t},\n \t\tStrategy: {Type: \"RollingUpdate\", RollingUpdate: &{MaxUnavailable: &{Type: 1, StrVal: \"25%\"}, MaxSurge: &{}}},\n \t\tMinReadySeconds: 30,\n \t\t... // 3 identical fields\n \t},\n \tStatus: {ObservedGeneration: 1, Replicas: 2, UpdatedReplicas: 2, UnavailableReplicas: 2, ...},\n }\n"}
Note the following in the diff:
Ports: []v1.ContainerPort{ { Name: \"http\", - HostPort: 80, + HostPort: 0, ContainerPort: 80, Protocol: \"TCP\", HostIP: \"\", }, { Name: \"https\", - HostPort: 443, + HostPort: 0, ContainerPort: 443, Protocol: \"TCP\", HostIP: \"\", }, { Name: \"metrics\", - HostPort: 1936, + HostPort: 0, ContainerPort: 1936, Protocol: \"TCP\", HostIP: \"\", }, },
The operator should ignore updates by the API that only set default values. The operator should not perform these unnecessary updates to the router deployment.
Description of problem:
oc-mirror fails to on arm64 platform with error : Rendering catalog image "ec2-18-224-73-36.us-east-2.compute.amazonaws.com:5000/arm/home/ec2-user/ocmtest/oci-multi-index:1fb06f" with file-based catalog Rendering catalog image "ec2-18-224-73-36.us-east-2.compute.amazonaws.com:5000/arm/redhat/community-operator-index:v4.13" with file-based catalog error: error rebuilding catalog images from file-based catalogs: error regenerating the cache for ec2-18-224-73-36.us-east-2.compute.amazonaws.com:5000/arm/redhat/community-operator-index:v4.13: fork/exec /home/ec2-user/ocmtest/oc-mirror-workspace/src/catalogs/registry.redhat.io/redhat/community-operator-index/v4.13/bin/usr/bin/registry/opm: exec format error
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Clone the repo to arm64 cluster and build oc-mirror; 2. Copy the catalog index to localhost ; `skopeo copy --all --format oci docker://registry.redhat.io/redhat/redhat-operator-index:v4.13 oci:///home/ec2-user/ocmtest/oci-multi-index --remove-signatures` 3. Run the oc-mirror command : apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration archiveSize: 16 mirror: operators: - catalog: oci:///home/ec2-user/ocmtest/oci-multi-index full: false # only mirror the latest versions packages: - name: cluster-logging - catalog: registry.redhat.io/redhat/community-operator-index:v4.13 full: false # only mirror the latest versions packages: - name: namespace-configuration-operator `oc-mirror --config config-413.yaml docker://xxxx:5000/arm --dest-skip-tls`
Expected results:
No errors and succeed
After installation with the assisted installer, the cluster contains BareMetalHost CRs (in the 'unmanaged' state) generated by assisted. These CRs include HardwareDetails data captured from the assisted-installer-agent.
Likely due to misleading documentation in Metal³ (since fixed by https://github.com/metal3-io/baremetal-operator/pull/657), the name field of storage devices is set to a name like sda instead of what Metal³'s own inspection would set it to, which is /dev/sda. This field is meant to be round-trippable to the rootDeviceHints, and as things stand it is not.
Description of problem:
Due to https://github.com/openshift/cluster-monitoring-operator/pull/1986, the prometheus-operator was instructed to inject the app.kubernetes.io/part-of: openshift-monitoring label (via its --labels option) to resources it creates. The label is also
Version-Release number of selected component (if applicable):
4.14
How reproducible:
upgrade to a 4.14 version with the commit https://github.com/openshift/cluster-monitoring-operator/pull/1986
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
We should avoid recreating the statefulset as this leads to downtime (for Prometheus, both Pods are recreated)
Additional info:
When we set the k8s.ovn.org/node-primary-ifaddr annotation on the node, we simply take the first valid IP address we find on the node gateway. We exclude link-local addresses and those in internally reserved subnets (https://github.com/openshift/ovn-kubernetes/pull/1386).
Now, we might have more than one "valid" IP address on the gateway, as observed in:
https://bugzilla.redhat.com/show_bug.cgi?id=2081390#c11 , https://bugzilla.redhat.com/show_bug.cgi?id=2081390#c14
For instance, taken from a different cluster than in the linked BZ:
sh-4.4# ip a show br-ex
7: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 00:52:12:af:f3:53 brd ff:ff:ff:ff:ff:ff
inet6 fd69::2/125 scope global dadfailed tentative <---- masquerade IP, excluded
valid_lft forever preferred_lft forever
inet6 fd2e:6f44:5dd8:c956::4/128 scope global nodad deprecated <--- real node IP, included
valid_lft forever preferred_lft 0sec
inet6 fd2e:6f44:5dd8:c956::17/128 scope global dynamic noprefixroute <---added by keepalive, INCLUDED!!
valid_lft 3017sec preferred_lft 3017sec
inet6 fe80::252:12ff:feaf:f353/64 scope link noprefixroute <--- link local, excluded
valid_lft forever preferred_lft forever
Above we have fd2e:6f44:5dd8:c956::4/128 which is the LB VIP of ingress added by keepalive.
We don't currently distinguish in the code between the node IP as in node.spec.IP and other IPs that might be added to br-ex by other components.
Would it be a good idea to just set the node primary address annotation to match node.spec.IP?
Description of problem:
If you check the Ironic API logs from a bootstrap VM, you'll see that terraform is making several GET requests per second. This is way too much, bare metal machine states do not change that fast. Not even on virtual emulation.
2023-03-01 12:37:38.234 1 INFO eventlet.wsgi.server [None req-c5628ecb-c94c-4b7c-95b3-2ee933ba850b - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200 len: 3659 time: 0.0060174[00m 2023-03-01 12:37:38.240 1 INFO eventlet.wsgi.server [None req-275e077e-8ec7-43a9-8948-e1d39b46b331 - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200 len: 3659 time: 0.0056679[00m 2023-03-01 12:37:38.246 1 INFO eventlet.wsgi.server [None req-0d867822-fcff-4ba0-8773-37415b3f532f - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200 len: 3659 time: 0.0056052[00m 2023-03-01 12:37:38.252 1 INFO eventlet.wsgi.server [None req-7e64cb21-869e-4a98-ad18-54adb6e5dec5 - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200 len: 3659 time: 0.0055907[00m 2023-03-01 12:37:38.258 1 INFO eventlet.wsgi.server [None req-de9995a8-9201-47b0-aa40-505e39b48279 - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200 len: 3659 time: 0.0055318[00m 2023-03-01 12:37:38.265 1 INFO eventlet.wsgi.server [None req-9e969582-0388-4e47-ad5b-966e1fd2a6da - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200 len: 3659 time: 0.0059781[00m 2023-03-01 12:37:38.354 1 INFO eventlet.wsgi.server [None req-84fad0b8-2a28-476e-90c9-ebb6a9cda833 - - - - - -] fd2e:6f44:5dd8:c956::1 "GET /v1/nodes/a7364b73-eefb-4f0a-8d63-753d30b9d090 HTTP/1.1" status: 200 len: 3659 time: 0.0884116[00m
Description of problem:
Currently the Knative Routes Details page doesnot show the URL of the Route.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Install Knative Serving (Serverless Operator) 2. Create a SF from the Add Page. 3. Navigate to the Knative Routes Details page
Actual results:
No URL is shown
Expected results:
URL should be shown
Additional info:
Images: https://drive.google.com/drive/folders/13Ya0mFhDrgFIrVcq6DaLyOxZbatz82Al?usp=share_link
Description of problem:
when using agent based installer to provision OCP, the Validation failed with the following message: "id": "sufficient-installation-disk-speed" "status": "failure" "message": "While preparing the previous installation the installation disk speed measurement failed or was found to be insufficient"
Version-Release number of selected component (if applicable):
4.13.0 { "versions": { "assisted-installer": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3a8b33263729ab42c0ff29b9d5e8b767b7b1a9b31240c592fa8d173463fb04d1", "assisted-installer-controller": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ce3e2e4aac617077ac98b82d9849659595d85cd31f17b3213da37bc5802b78e1", "assisted-installer-service": "Unknown", "discovery-agent": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70397ac41dffaa5f3333c00ac0c431eff7debad9177457a038b6e8c77dc4501a" } }
How reproducible:
100%
Steps to Reproduce:
1. Using agent based installer provision the DELL 16G server 2. 3.
Actual results:
Validation failed with "sufficient-installation-disk-speed"
Expected results:
Validation pass
Additional info:
[root@c2-esx02 bin]# lsblkNAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTSloop0 7:0 0 125.7G 0 loop /var/lib/containers/storage/overlay /var /etc /run/ephemeralloop1 7:1 0 934M 0 loop /usr /boot / /sysrootnvme1n1 259:0 0 1.5T 0 disknvme0n1 259:2 0 894.2G 0 disk├─nvme0n1p1 259:6 0 2M 0 part├─nvme0n1p2 259:7 0 20M 0 part├─nvme0n1p3 259:8 0 93.1G 0 part├─nvme0n1p4 259:9 0 701.9G 0 part└─nvme0n1p5 259:10 0 99.2G 0 partnvme2n1 259:3 0 1.5T 0 disknvme4n1 259:4 0 1.5T 0 disknvme3n1 259:5 0 1.5T 0 disk[root@c2-esx02 bin]# ls -lh /dev |grep nvmecrw-------. 1 root root 239, 0 Jun 12 06:01 nvme0-rw-r--r--. 1 root root 4.0M Jun 12 06:04 nvme0c0n1brw-rw----. 1 root disk 259, 2 Jun 12 06:01 nvme0n1brw-rw----. 1 root disk 259, 6 Jun 12 06:01 nvme0n1p1brw-rw----. 1 root disk 259, 7 Jun 12 06:01 nvme0n1p2brw-rw----. 1 root disk 259, 8 Jun 12 06:01 nvme0n1p3brw-rw----. 1 root disk 259, 9 Jun 12 06:01 nvme0n1p4brw-rw----. 1 root disk 259, 10 Jun 12 06:01 nvme0n1p5crw-------. 1 root root 239, 1 Jun 12 06:01 nvme1brw-rw----. 1 root disk 259, 0 Jun 12 06:01 nvme1n1crw-------. 1 root root 239, 2 Jun 12 06:01 nvme2brw-rw----. 1 root disk 259, 3 Jun 12 06:01 nvme2n1crw-------. 1 root root 239, 3 Jun 12 06:01 nvme3brw-rw----. 1 root disk 259, 5 Jun 12 06:01 nvme3n1crw-------. 1 root root 239, 4 Jun 12 06:01 nvme4brw-rw----. 1 root disk 259, 4 Jun 12 06:01 nvme4n1[root@c2-esx02 bin]# lsblk -f nvme0c0n1lsblk: nvme0c0n1: not a block device[root@c2-esx02 bin]# ls -l /dev/disk/by-id/total 0lrwxrwxrwx. 1 root root 13 Jun 12 06:01 google-CN0WW56VFCP0033900HU -> ../../nvme0n1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 google-CN0WW56VFCP0033900HU-part1 -> ../../nvme0n1p1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 google-CN0WW56VFCP0033900HU-part2 -> ../../nvme0n1p2lrwxrwxrwx. 1 root root 15 Jun 12 06:01 google-CN0WW56VFCP0033900HU-part3 -> ../../nvme0n1p3lrwxrwxrwx. 1 root root 15 Jun 12 06:01 google-CN0WW56VFCP0033900HU-part4 -> ../../nvme0n1p4lrwxrwxrwx. 1 root root 15 Jun 12 06:01 google-CN0WW56VFCP0033900HU-part5 -> ../../nvme0n1p5lrwxrwxrwx. 1 root root 13 Jun 12 06:01 google-PHAB112600291P9SGN -> ../../nvme3n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 google-PHAB115400P81P9SGN -> ../../nvme2n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 google-PHAB120401CP1P9SGN -> ../../nvme1n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 google-PHAB124501MF1P9SGN -> ../../nvme4n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU -> ../../nvme0n1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU-part1 -> ../../nvme0n1p1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU-part2 -> ../../nvme0n1p2lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU-part3 -> ../../nvme0n1p3lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU-part4 -> ../../nvme0n1p4lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-Dell_BOSS-N1_CN0WW56VFCP0033900HU-part5 -> ../../nvme0n1p5lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-Dell_Ent_NVMe_P5600_MU_U.2_1.6TB_PHAB112600291P9SGN -> ../../nvme3n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-Dell_Ent_NVMe_P5600_MU_U.2_1.6TB_PHAB115400P81P9SGN -> ../../nvme2n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-Dell_Ent_NVMe_P5600_MU_U.2_1.6TB_PHAB120401CP1P9SGN -> ../../nvme1n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-Dell_Ent_NVMe_P5600_MU_U.2_1.6TB_PHAB124501MF1P9SGN -> ../../nvme4n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-eui.0050434209000001 -> ../../nvme0n1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-eui.0050434209000001-part1 -> ../../nvme0n1p1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-eui.0050434209000001-part2 -> ../../nvme0n1p2lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-eui.0050434209000001-part3 -> ../../nvme0n1p3lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-eui.0050434209000001-part4 -> ../../nvme0n1p4lrwxrwxrwx. 1 root root 15 Jun 12 06:01 nvme-eui.0050434209000001-part5 -> ../../nvme0n1p5lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-eui.01000000000000005cd2e44e7a445351 -> ../../nvme2n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-eui.01000000000000005cd2e48f14515351 -> ../../nvme1n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-eui.01000000000000005cd2e49d3e605351 -> ../../nvme4n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 nvme-eui.01000000000000005cd2e4fd973e5351 -> ../../nvme3n1[root@c2-esx02 bin]# ls -l /dev/disk/by-pathtotal 0lrwxrwxrwx. 1 root root 13 Jun 12 06:01 pci-0000:01:00.0-nvme-1 -> ../../nvme0n1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 pci-0000:01:00.0-nvme-1-part1 -> ../../nvme0n1p1lrwxrwxrwx. 1 root root 15 Jun 12 06:01 pci-0000:01:00.0-nvme-1-part2 -> ../../nvme0n1p2lrwxrwxrwx. 1 root root 15 Jun 12 06:01 pci-0000:01:00.0-nvme-1-part3 -> ../../nvme0n1p3lrwxrwxrwx. 1 root root 15 Jun 12 06:01 pci-0000:01:00.0-nvme-1-part4 -> ../../nvme0n1p4lrwxrwxrwx. 1 root root 15 Jun 12 06:01 pci-0000:01:00.0-nvme-1-part5 -> ../../nvme0n1p5lrwxrwxrwx. 1 root root 13 Jun 12 06:01 pci-0000:c3:00.0-nvme-1 -> ../../nvme1n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 pci-0000:c4:00.0-nvme-1 -> ../../nvme2n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 pci-0000:c5:00.0-nvme-1 -> ../../nvme3n1lrwxrwxrwx. 1 root root 13 Jun 12 06:01 pci-0000:c6:00.0-nvme-1 -> ../../nvme4n1
Description of problem:
The timeout of calls to the csi driver from both the external csi-provisioner and csi-attacher are 15 seconds by default. However hotplugging a volume into the Virtual Machine can take up to a minute (sometimes more). This causes the context timeout to expire, and in some cases causes the bookkeeping of what volumes are attached to become corrupted, and detaching the volumes doesn't always get handled properly afterwards.
Version-Release number of selected component (if applicable):
How reproducible:
Run the standard csi conformance tests against the csi driver. Most of the runs this issue will appear as a random failed test or two. The failed test are because the deletion of the persistent volume never happens. Because of this we cannot get a good signal on the state of the csi driver.
Steps to Reproduce:
1. 2. 3.
Actual results:
Random failed tests of the csi conformance suite.
Expected results:
csi conformance suite passes
Additional info:
Fixed in upstream by increasing the timeouts to 3 minutes instead of 15 seconds.
Description of problem:
After adding FailureDomain topology as day-2 operation, I get ProvisioningFailed due to error generating accessibility requirements: no topology key found on CSINode ocp-storage-fxsc6-worker-0-fb977
Version-Release number of selected component (if applicable):
pre-merge payload with opt-in CSIMigration PRs
How reproducible:
2/2
Steps to Reproduce:
1. I installed the cluster without specifying the failureDomains (so I got one which generated by installer) 2. Added new failureDomain to test topology, and make sure all related resources(datacenterand ClusterComputeResource) are tagged in vsphere 3. create pvc but failed with provisioning: Warning ProvisioningFailed 80m (x14 over 103m) csi.vsphere.vmware.com_ocp-storage-fxsc6-master-0_a18e2651-6455-42b2-abc2-b3b3d197da56 failed to provision volume with StorageClass "thin-csi": error generating accessibility requirements: no topology key found on CSINode ocp-storage-fxsc6-worker-0-fb977 4. Here is the node label and csinode info $ oc get node ocp-storage-fxsc6-worker-0-b246w --show-labels NAME STATUS ROLES AGE VERSION LABELS ocp-storage-fxsc6-worker-0-b246w Ready worker 8h v1.26.3+2727aff beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-storage-fxsc6-worker-0-b246w,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos $ oc get csinode ocp-storage-fxsc6-worker-0-b246w -ojson | jq .spec.drivers[].topologyKeys null 5. other logs: I only find something in csi-driver-controller-8597f567f8-4f8z6 {"level":"info","time":"2023-04-17T10:30:13.352999527Z","caller":"k8sorchestrator/topology.go:326","msg":"failed to retrieve tags for category \"cns.vmware.topology-preferred-datastores\". Reason: GET https://ocp-storage.vmc.qe.devcluster.openshift.com:443/rest/com/vmware/cis/tagging/category/id:cns.vmware.topology-preferred-datastores: 404 Not Found","TraceId":"573c3fc8-e6cf-4594-8154-07bd514fcb46"} In the vpd pod, the tag check passed: I0417 11:05:02.711093 1 util.go:110] Looking for CC: workloads-02 I0417 11:05:02.766516 1 zones.go:168] ClusterComputeResource: ClusterComputeResource:domain-c5265 @ /OCP-DC/host/workloads-02 I0417 11:05:02.766622 1 zones.go:64] Validating tags for ClusterComputeResource:domain-c5265. I0417 11:05:02.813568 1 zones.go:81] Processing attached tags I0417 11:05:02.813678 1 zones.go:90] Found Region: region-A I0417 11:05:02.813721 1 zones.go:96] Found Zone: zone-B I0417 11:05:02.834718 1 util.go:110] Looking for CC: qe-cluster/workloads-03 I0417 11:05:02.844475 1 reflector.go:559] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: Watch close - *v1.ConfigMap total 7 items received I0417 11:05:02.890279 1 zones.go:168] ClusterComputeResource: ClusterComputeResource:domain-c9002 @ /OCP-DC/host/qe-cluster/workloads-03 I0417 11:05:02.890406 1 zones.go:64] Validating tags for ClusterComputeResource:domain-c9002. I0417 11:05:02.946720 1 zones.go:81] Processing attached tags I0417 11:05:02.946871 1 zones.go:96] Found Zone: zone-C I0417 11:05:02.946917 1 zones.go:90] Found Region: region-A I0417 11:05:02.946965 1 vsphere_check.go:242] CheckZoneTags passed
Actual results:
Provisioning failed.
Expected results:
Provisioning should be succeed.
Additional info:
Please review the following PR: https://github.com/openshift/bond-cni/pull/52
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If a custom API server certificate is added as per documentation[1], but the secret name is wrong and points to a non-existing secret, the following happens: - The kube-apiserver config is rendered with some of the namedCertificates pointing to /etc/kubernetes/static-pod-certs/secrets/user-serving-cert-000/ - As the secret in apiserver/cluster object is wrong, no user-serving-cert-000 secret is generated, so the /etc/kubernetes/static-pod-certs/secrets/user-serving-cert-000/ does not exist (and may be automatically removed if manually created). - The combination of the 2 points above causes kube-apiserver to start crash-looping because its config points to non-existent certificates. This is a cluster-kube-apiserver-operator, because it should validate that the specified secret exists and degrade and do nothing if it doesn't, not render inconsistent configuration.
Version-Release number of selected component (if applicable):
First found in 4.11.13, but also reproduced in the latest nightly build.
How reproducible:
Always
Steps to Reproduce:
1. Setup a named certificate pointing to a secret that doesn't exist. 2. 3.
Actual results:
Inconsistent configuration that points to non-existing secret. Kube API server pod crash-loop.
Expected results:
Cluster Kube API Server Operator to detect that the secret is wrong, do nothing and only report itself as degraded with meaningful message so the user can fix. No Kube API server pod crash-looping.
Additional info:
Once the kube-apiserver is broken, even if the apiserver/cluster object is fixed, it is usually needed to apply a manual workaround in the crash-looping master. An example of workaround that works is[2], even though that KB article was written for another bug with different root cause. References: [1] - https://docs.openshift.com/container-platform/4.11/security/certificates/api-server.html#api-server-certificates [2] - https://access.redhat.com/solutions/4893641
The ability to schedule workloads on master nodes is currently exposed via the REST API as a boolean Cluster property "schedulable_masters". For the k8s, we should align with other OpenShift APIs and have a boolean property in the ACM Spec called mastersSchedulable.
Description of problem:
[performance] Checking IRQBalance settings Verify GloballyDisableIrqLoadBalancing Spec field [test_id:36150] Verify that IRQ load balancing is enabled/disabled correctly [rfe_id:27368][performance] Pre boot tuning adjusted by tuned [test_id:35363][crit:high][vendor:cnf-qe@redhat.com][level:acceptance] stalld daemon is running on the host [rfe_id:27363][performance] CPU Management Verification of cpu manager functionality Verify CPU usage by stress PODs [test_id:27492] Guaranteed POD should work on isolated cpu tests fails often in 4.13 and 4.14 upstream CI jobs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-telco5g-cnftests/1669344976506458112/artifacts/e2e-telco5g-cnftests/telco5g-cnf-tests/artifacts/test_results.html
Version-Release number of selected component (if applicable):
4.14 4.13
How reproducible:
CI job
Steps to Reproduce:
Ci job
Actual results:
failures
Expected results:
pass
Additional info:
https://snapshots.raintank.io/dashboard/snapshot/6sZ1uBR5P1O1gknyxebPQPtEo7RVEu0C history and pass/fail ratio
Description of problem:
Update the VScode extension link to https://marketplace.visualstudio.com/items?itemName=redhat.vscode-openshift-connector
And change the description to
The OpenShift Serverless Functions support in the VSCode IDE extension enables developers to effortlessly create, build, run, invoke and deploy serverless functions on OpenShift, providing a seamless development experience within the familiar VSCode environment.
This is a clone of issue OCPBUGS-19019. The following is the description of the original issue:
—
Using metal-ipi with okd-scos ironic fails to provision nodes
Description of problem:
I have completed to install OCP as 3 masters and 2 workers. But I was not able to find mastersSchedulable parameter after command below from all files on manafest directory. $ openshift-install agent create cluster-manifests --log-level debug --dir kni And I used the installer this. https://github.com/openshift/installer/releases/tag/agent-installer-v4.11.0-dev-preview-2
Version-Release number of selected component (if applicable):
How reproducible:
execution the installer
Steps to Reproduce:
1. download the installer 2. openshift-install agent create cluster-manifests --log-level debug --dir kni
Actual results:
There is no mastersSchedulable parameter
Expected results:
Some file(like cluster-scheduler-02-config.yml) has mastersSchedulable parameter
Additional info:
Description of the problem:
In BE 2.16.0 - try to install new cluster with enabled ignore-validation {"host-validation-ids": "[\"all\"]", "cluster-validation-ids": "[\"all\"]"} - one host with less HD space (18GB). Installation starts, but after 20 minutes waiting, cluster is back to draft status without any event
How reproducible:
100%
Steps to reproduce:
1. Create new multi cluster - configure one of the hosts to have 18GB HD (minimum req is 20GB)
2. Enable ignore-validations by:
curl -X 'PUT' \ 'http://api.openshift.com/api/assisted-install/v2/clusters/eaffbd37-2a0b-42b2-a706-ad5b23ff17a3/ignored-validations' \ --header "Authorization: Bearer $(ocm token)" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "ignored_host_validations": "[\"all\"]", "ignored_cluster_validations": "[\"all\"]" }'
3. start installation. cluster is stuck on prepare-for-installation for 20 minutes and then moves to draft with no event about the reason
Actual results:
Expected results:
This issue is valid for UI and API.
For UI
If a new cluster is being created and s390x is selected as architecture, an error message pops up if next button is being pressed (all other necessary values are filed correctly):
"cannot use Minimal ISO because it's not compatible with the s390x architecture on version 4.13.0-rc.3-multi of OpenShift"
There is no workaround because the matching selection (full-iso or iPXE) could be set on addHosts Dialog.
For API
The infra env object could not be created if type is not set. The error message:
"cannot use Minimal ISO because it's not compatible with the s390x architecture on version 4.13.0-rc.3-multi of OpenShift"
is returned.
Workaround is to set image_type to "full-iso" during infra env creation.
For s390x architecture the default should be always full-iso.
Please review the following PR: https://github.com/openshift/cloud-provider-nutanix/pull/12
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
There is error when creating image: FATAL failed to fetch Agent Installer ISO: failed to generate asset "Agent Installer ISO": stat /home/core/.cache/agent/files_cache/libnmstate.so.1.3.3: no such file or directory
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-04-06-060829
How reproducible:
always
Steps to Reproduce:
1. Prepare the agent-config.yaml and install-config.yaml files 2. Run 'bin/openshift-install agent create image --log-level debug' 3. There is following output with errors: DEBUG extracting /usr/bin/agent-tui to /home/core/.cache/agent/files_cache, oc image extract --path /usr/bin/agent-tui:/home/core/.cache/agent/files_cache --confirm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c11d31d47db4afb03e4a4c8c40e7933981a2e3a7ef9805a1413c441f492b869b DEBUG Fetching image from OCP release (oc adm release info --image-for=agent-installer-node-agent --insecure=true registry.ci.openshift.org/ocp/release@sha256:83caa0a8f2633f6f724c4feb517576181d3f76b8b76438ff752204e8c7152bac) DEBUG extracting /usr/lib64/libnmstate.so.1.3.3 to /home/core/.cache/agent/files_cache, oc image extract --path /usr/lib64/libnmstate.so.1.3.3:/home/core/.cache/agent/files_cache --confirm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c11d31d47db4afb03e4a4c8c40e7933981a2e3a7ef9805a1413c441f492b869b DEBUG File /usr/lib64/libnmstate.so.1.3.3 was not found, err stat /home/core/.cache/agent/files_cache/libnmstate.so.1.3.3: no such file or directory ERROR failed to write asset (Agent Installer ISO) to disk: cannot generate ISO image due to configuration errors FATAL failed to fetch Agent Installer ISO: failed to generate asset "Agent Installer ISO": stat /home/core/.cache/agent/files_cache/libnmstate.so.1.3.3: no such file or directory
Actual results:
The image generate fail
Expected results:
The image should generate success.
Additional info:
Description of problem:
When typing into the filter input field at the Quick Starts page, console will crash
Version-Release number of selected component (if applicable):
4.13.0-rc.7
How reproducible:
Always
Steps to Reproduce:
1. Go to the Quick Starts page 2. Type something into the filter input field 3.
Actual results:
Console will crash: TypeError Description: t.toLowerCase is not a functionComponent trace: at Sn (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:168364) at t.default (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:874032) at t.default (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/quick-start-chunk-274c58e3845ea0aa718b.min.js:1:202) at s (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:241397) at s (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:241397) at t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:21:67583) at T at t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:21:69628) at Suspense at i (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:450974) at section at m (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:720272) at div at div at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1528877) at div at div at c (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:545409) at d (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:774923) at div at d (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:458124) at l (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1170951) at https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:457833 at S (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:98:86864) at main at div at v (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:264066) at div at div at c (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:62024) at div at div at c (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:545409) at d (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:774923) at div at d (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendor-patternfly-core-chunk-67ceb971158ed93c9c79.min.js:1:458124) at Un (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:183620) at t.default (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:874032) at t.default (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/quick-start-chunk-274c58e3845ea0aa718b.min.js:1:1261) at s (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:241397) at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1605535) at ee (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1623254) at _t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:142374) at ee (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1623254) at ee (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1623254) at ee (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1623254) at i (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:829516) at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1599727) at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1599916) at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1597332) at te (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1623385) at https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1626517 at r (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:121910) at t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:21:67583) at t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:21:69628) at t (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:21:64188) at re (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1626828) at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:803496) at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:1074899) at s (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/main-chunk-4a1d080acbda22020fbd.min.js:1:652518) at t.a (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:150:190871) at Suspense Stack trace: TypeError: t.toLowerCase is not a function at pt (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:136019) at Sn (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:36:168723) at na (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:58879) at za (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:68397) at Hs (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:112289) at xl (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:98327) at Cl (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:98255) at _l (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:98118) at pl (https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:95105) at https://console-openshift-console.apps.viraj-10-05-2023-2.devcluster.openshift.com/static/vendors~main-chunk-141f889230d63da0ba53.min.js:263:44774
Expected results:
Console should work
Additional info:
Description of problem:
Console-operator's config file gets updated every couple of seconds, where only the `resourceVersion` field get s changed.
Version-Release number of selected component (if applicable):
4.14-ec-2
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Getting below error while creating cluster in mon01 zone Joblink: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-ovn-ppc64le-powervs/1680759459892170752 Error: level=info msg=Cluster operator insights SCAAvailable is False with Forbidden: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 403: {"code":"ACCT-MGMT-11","href":"/api/accounts_mgmt/v1/errors/11","id":"11","kind":"Error","operation_id":"c3773b1e-8818-4bfc-9605-dbd9dbc0c03f","reason":"Account with ID 2DUeKzzTD9ngfsQ6YgkzdJn1jA4 denied access to perform create on Certificate with HTTP call POST /api/accounts_mgmt/v1/certificates"} level=info msg=Cluster operator network ManagementStateDegraded is False with : level=error msg=Cluster operator storage Degraded is True with PowerVSBlockCSIDriverOperatorCR_PowerVSBlockCSIDriverStaticResourcesController_SyncError: PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/main_attacher_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "openshift-csi-main-attacher-role" not found level=error msg=PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/main_provisioner_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "openshift-csi-main-provisioner-role" not found level=error msg=PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/volumesnapshot_reader_provisioner_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "openshift-csi-provisioner-volumesnapshot-reader-role" not found level=error msg=PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/main_resizer_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "openshift-csi-main-resizer-role" not found level=error msg=PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/storageclass_reader_resizer_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "openshift-csi-resizer-storageclass-reader-role" not found level=error msg=PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded:
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Steps to Reproduce:
1. 2. 3.
Expected results:
cluster creation should be successful
Additional info:
The cluster-kube-apiserver-operator CI has been constantly failing for the past week and more specifically the e2e-gcp-operator job because the test cluster ends in a state where a lot of requests start failing with "Unauthorized" errors.
This caused multiple operators to become degraded and tests to fail.
Looking at the failures and a must-gather we were able to capture inside of a test cluster, it turned out that the service account issuer could be the culprit here. Because of that we opened https://issues.redhat.com/browse/API-1549.
However, it turned that disabling TestServiceAccountIssuer didn't resolve the issue and the cluster was still too unstable for the tests to pass.
In a separate attempt we also tried disabling TestBoundTokenSignerController and this time the tests were passing. However, the cluster was still very unstable during the e2e run and the kube-apiserver-operator went degraded a couple of times: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/1455/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-gcp-operator/1632871645171421184/artifacts/e2e-gcp-operator/gather-extra/artifacts/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-5cf9d4569-m2spq_kube-apiserver-operator.log.
On top of that instead of seeing Unauthorized errors, we are now seeing a lot of connection refused.
Description of problem:
The description for the BuildAdapter SDK extension is wrong.
Actual results:
BuildAdapter contributes an adapter to adapt element to data that can be used by Pod component
Expected results:
BuildAdapter contributes an adapter to adapt element to data that can be used by Build component
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
All versions?
At least on 4.12+
How reproducible:
Always
Steps to Reproduce:
This JSON works fine:
{ "apiVersion": "v1", "kind": "ConfigMap", "metadata": { "generateName": "a-configmap-" } }
But neither an array could be used to import multiple resources:
[ { "apiVersion": "v1", "kind": "ConfigMap", "metadata": { "generateName": "a-configmap-" } }, { "apiVersion": "v1", "kind": "ConfigMap", "metadata": { "generateName": "a-configmap-" } } ]
Fails with error: No "apiVersion" field found in YAML.
Nor a Kubernetes List "resource" could be used:
{ "apiVersion": "v1", "kind": "List", "items": [ { "apiVersion": "v1", "kind": "ConfigMap", "metadata": { "generateName": "a-configmap-" } }, { "apiVersion": "v1", "kind": "ConfigMap", "metadata": { "generateName": "a-configmap-" } } ] }
Fails with error: The server doesn't have a resource type "kind: List, apiVersion: v1".
Actual results:
Both JSON structures could not be imported.
Expected results:
Both JSON structures works fine and create multiple resources.
If the JSON array contains just one item the resource detail page should be opened, otherwise the import result page similar to when the user imports a yaml with multiple resources.
Additional info:
Found this JSON structure for example in issue OCPBUGS-4646
Description of problem:
DNS Local endpoint preference is not working for TCP DNS requests for Openshift SDN. Reference code: https://github.com/openshift/sdn/blob/b58a257b896d774e0a092612be250fb9414af5ca/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L999-L1012 This is where the DNS request is short-circuited to the local DNS endpoint if it exists. This is important because DNS local preference protects against another outstanding bug, in which daemonset pods go stale for a few second upon node shutdown (see https://issues.redhat.com/browse/OCPNODE-549 for fix for graceful node shutdown). This appears to be contributing to DNS issues in our internal CI clusters. https://lookerstudio.google.com/reporting/3a9d4e62-620a-47b9-a724-a5ebefc06658/page/MQwFD?s=kPTlddLa2AQ shows large amounts of "dns_tcp_lookup" failures, which I attribute to this bug. UDP DNS local preference is working fine in Openshift SDN. Both UDP and TCP local preference work fine in OVN. It's just TCP DNS Local preference that is not working Openshift SDN.
Version-Release number of selected component (if applicable):
4.13, 4.12, 4.11
How reproducible:
100%
Steps to Reproduce:
1. oc debug -n openshift-dns 2. dig +short +tcp +vc +noall +answer CH TXT hostname.bind # Retry multiple times, and you should always get the same local DNS pod.
Actual results:
[gspence@gspence origin]$ oc debug -n openshift-dns Starting pod/image-debug ... Pod IP: 10.128.2.10 If you don't see a command prompt, try pressing enter. sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-glgr8" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-gzlhm" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-dnbsp" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-gzlhm"
Expected results:
[gspence@gspence origin]$ oc debug -n openshift-dns Starting pod/image-debug ... Pod IP: 10.128.2.10 If you don't see a command prompt, try pressing enter. sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-glgr8" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-glgr8" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-glgr8" sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind "dns-default-glgr8"
Additional info:
https://issues.redhat.com/browse/OCPBUGS-488 is the previous bug I opened for UDP DNS local preference not working. iptables-save from a 4.13 vanilla cluster bot AWS,SDN: https://drive.google.com/file/d/1jY8_f64nDWi5SYT45lFMthE0vhioYIfe/view?usp=sharing
As a user of the HyperShift CLI, I would like to be able to set the NodePool UpgradeType through a flag when either creating a new cluster or creating a new NodePool.
DoD:
There are few cases that this validation doesn't cover
The validation will be enabled with MGMT-15112
Description of problem:
we need update the govc version to support PR:https://github.com/openshift/release/pull/42334. As the command "govc vm.network.change -dc xxx -vm -net xxxxx " only support after govc version v0.30.4. then vm can not fetch ip correctly.
Version-Release number of selected component (if applicable):
ocp 4.14
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
"govc: path 'ci-segment-151'" resolves to multiple networks if specific the -net with network path, will got "govc: network '/IBMCloud/host/vcs-mdcnc-workload-1/ci-segment-151' not found"
Expected results:
govc version update, govc vm.network.change can be used to get the unique network.
Additional info:
OCP 4.14.0-rc.0
advanced-cluster-management.v2.9.0-130
multicluster-engine.v2.4.0-154
After encountering https://issues.redhat.com/browse/OCPBUGS-18959
Attempted to forcefully delete the BMH by removing the finalizer.
Then deleted all the metal3 pods.
Attempted to re-create the bmh.
Result:
the bmh is stuck in
oc get bmh
NAME STATE CONSUMER ONLINE ERROR AGE
hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com registering true 15m
seeing this entry in the BMO log:
{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"controllers.BareMetalHost","msg":"start","baremetalhost":{"name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","namespace":"kni-qe-65"}}
{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"controllers.BareMetalHost","msg":"hardwareData is ready to be deleted","baremetalhost":{"name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","namespace":"kni-qe-65"}}
{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"controllers.BareMetalHost","msg":"host ready to be powered off","baremetalhost":
,"provisioningState":"powering off before delete"}
{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: hard)","host":"kni-qe-65~hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com"}{"level":"error","ts":"2023-09-13T16:15:57Z","msg":"Reconciler error","controller":"baremetalhost","controllerGroup":"metal3.io","controllerKind":"BareMetalHost","BareMetalHost":
{"name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","namespace":"kni-qe-65"},"namespace":"kni-qe-65","name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","reconcileID":"167061cc-7ab4-4c4a-ae45-8c19dfc3ac22","error":"action \"powering off before delete\" failed: failed to power off before deleting node: Host not registered","errorVerbose":"Host not registered\nfailed to power off before deleting node\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionPowerOffBeforeDeleting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:493\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handlePoweringOffBeforeDelete\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:585\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:202\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598\naction \"powering off before delete\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:229\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"}
Description of problem:
ovn-ipsec pods Crashes when IPSec NS extension/svc is enabled on any $ROLE nodes IPSec ext and svc were enabled for 2 WORKERS only and their corresponding ovn-ipsec pods are in CLBO [root@dell-per740-36 ipsec]# oc get pods NAME READY STATUS RESTARTS AGE dell-per740-14rhtsengpek2redhatcom-debug 1/1 Running 0 3m37s ovn-ipsec-bptr6 0/1 CrashLoopBackOff 26 (3m58s ago) 130m ovn-ipsec-bv88z 1/1 Running 0 3h5m ovn-ipsec-pre414-6pb25 1/1 Running 0 3h5m ovn-ipsec-pre414-b6vzh 1/1 Running 0 3h5m ovn-ipsec-pre414-jzwcm 1/1 Running 0 3h5m ovn-ipsec-pre414-vgwqx 1/1 Running 3 132m ovn-ipsec-pre414-xl4hb 1/1 Running 3 130m ovn-ipsec-qb2bj 1/1 Running 0 3h5m ovn-ipsec-r4dfw 1/1 Running 0 3h5m ovn-ipsec-xhdpw 0/1 CrashLoopBackOff 28 (116s ago) 132m ovnkube-control-plane-698c9845b8-4v58f 2/2 Running 0 3h5m ovnkube-control-plane-698c9845b8-nlgs8 2/2 Running 0 3h5m ovnkube-control-plane-698c9845b8-wfkd4 2/2 Running 0 3h5m ovnkube-node-l6sr5 8/8 Running 27 (66m ago) 130m ovnkube-node-mj8bs 8/8 Running 27 (75m ago) 132m ovnkube-node-p24x8 8/8 Running 0 178m ovnkube-node-rlpbh 8/8 Running 0 178m ovnkube-node-wdxbg 8/8 Running 0 178m [root@dell-per740-36 ipsec]#
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-12-024050
How reproducible:
Always
Steps to Reproduce:
1.Install OVN IPSec cluster (East-West) 2.Enable IPSec OS extension for North-South 3.Enable IPSec service for North-South
Actual results:
ovn-ipsec pods in CLBO state
Expected results:
All pods under ovn-kubernetes ns should be Running fine
Additional info:
One of the ovn-ipsec CLBO pods logs # oc logs ovn-ipsec-bptr6 Defaulted container "ovn-ipsec" out of: ovn-ipsec, ovn-keys (init) + rpm --dbpath=/usr/share/rpm -q libreswan libreswan-4.9-4.el9_2.x86_64 + counter=0 + '[' -f /etc/cni/net.d/10-ovn-kubernetes.conf ']' + echo 'ovnkube-node has configured node.' ovnkube-node has configured node. + ip x s flush + ip x p flush + ulimit -n 1024 + /usr/libexec/ipsec/addconn --config /etc/ipsec.conf --checkconfig + /usr/libexec/ipsec/_stackmanager start + /usr/sbin/ipsec --checknss + /usr/libexec/ipsec/pluto --leak-detective --config /etc/ipsec.conf --logfile /var/log/openvswitch/libreswan.log FATAL ERROR: /usr/libexec/ipsec/pluto: lock file "/run/pluto/pluto.pid" already exists leak: string logger, item size: 48 leak: string logger prefix, item size: 27 leak detective found 2 leaks, total size 75 journalctl -u ipsec here: https://privatebin.corp.redhat.com/?216142833d016b3c#2Es8ACSyM3VWvwi85vTaYtSx8X3952ahxCvSHeY61UtT
The issue:
An interesting issue came up on #forum-ui-extensibility. There was an attempt to use extensions to nest a details page under a details page that contained a horizontal nav. This caused an issue with rendering the page content when a sub link was clicked – which caused confusion.
The why:
The reason this happened was the resource details page had a tab that contained a resource list page. This resource list page showed a number of items of CRs that when clicked would try to append their name onto the URL. This confused the navigation, thinking that this path must be another tab, so no tabs were selected and no content was visible. The goal was to reuse this longer path name as a details page of its own with its own horizontal nav. This issue is a conceptual misunderstanding of the way our list & details pages work in OpenShift Console.
List Pages are sometimes found via direct navigation links. List pages are almost all shown on the Search page, allowing a user to navigate to both existing nav items and other non-primary resources.
Details Pages are individual items found in the List Pages (a row). These are stand alone pages that show details of a singular CR and optionally can have tabs that list other resources – but they always transition to a fresh Details page instead of compounding on the currently visible one.
The ask:
If we could document this in a fashion that can help Plugin developers share the same UX that the rest of the Console does then we will have a more unified approach to UX within the Console and through any installed Plugins.
==> Description of problem:
"Import from git" functionality with a local Bitbucket instance does not work, due to repository validation that requires to repository to be hosted on Bitbucket Cloud. [1][2]
==> Version-Release number of selected component (if applicable):
Tested in OCP 4.10
==> How reproducible: 100%
==> Steps to Reproduce:
1. Go to: Developer View > Add+ > From Git
2. Fill the "Git Repo URL" field with the BitBucket repo URL (i.e. http://<bitbucket_url>/scm/<project>/<repository>.git)
3. Select BitBucket from the "Git type" dropdowns button
==> Actual results:
"URL is valid but cannot be reached. If this is a private repository, enter a source Secret in advanced Git options"
==> Expected results:
This functionality should work also with hosted Bitbucket
==> Additional info:
To retrieve slug information from hosted BitBucket we can query: http://<bitbucket_url>/rest/api/1.0/projects/<project>/repos/<repository>
An example:
~~~
curl -ks http://bitbucket-server-bitbucket.apps.gmeghnag.lab.cluster/rest/api/1.0/projects/test/repos/test-repo | jq
{
"slug": "test-repo",
"id": 1,
"name": "test-repo",
"hierarchyId": "28fc5c8782050b43e223",
"scmId": "git",
"state": "AVAILABLE",
"statusMessage": "Available",
"forkable": true,
"project": {
"key": "TEST",
"id": 1,
"name": "test",
"public": false,
"type": "NORMAL",
"links": {
"self": [
]
}
},
"public": true,
"archived": false,
"links": {
"clone": [
,
{ "href": "ssh://git@bitbucket-server-bitbucket.apps.gmeghnag.lab.cluster:7999/test/test-repo.git", "name": "ssh" } ],
"self": [
]
}
}
~~~
Description of problem:
The must gather should contain additional debug information such as the current configuration and firmware settings of any Bluefields / Mellanox device when using SRIOV
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When the management cluster has ICSP resources, the pull reference of the Kube APIServer is replaced with a pull ref from the management cluster ICSPs resulting in a pull failure.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster with release registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-08-28-154013 on a management cluster that has ICSPs 2. Watch the kube-apiserver pods.
Actual results:
kube-apiserver pods are initially deployed with a pull ref from the release payload and they start, but then the deployment is updated with a pull ref from an ICSP mapping and the deployment fails to roll out.
Expected results:
kube-apiserver pods roll out successfully.
Additional info:
Description of problem:
The network-tools image stream is missing in the cluster samples. It is needed for CI tests.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
When creating a deployment with `oc new-app` and using `--import-mode=PreserveOriginal`, if there are containerports that are present in the dockerfile, they do not get propagated to the deployment `spec.containers[i].ports[i].containerPort`.
On further inspection this is because the config object which gets passed from the image to the deployment does not contain these details. The image reference in this case is a manifestlisted image which does not contain the docker metadata. Instead these need to be derived from the child manifest.
test=[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]
Appears to be perma-failing on gcp serial jobs.
We're at the edge of our visible data, but it looks like this may have happened around July 7
Description of problem:
revert "force cert rotation every couple days for development" in 4.13 Below is the steps to verify this bug: # oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-06-25-081133|grep -i cluster-kube-apiserver-operator cluster-kube-apiserver-operator https://github.com/openshift/cluster-kube-apiserver-operator 7764681777edfa3126981a0a1d390a6060a840a3 # git log --date local --pretty="%h %an %cd - %s" 776468 |grep -i "#1307" 08973b820 openshift-ci[bot] Thu Jun 23 22:40:08 2022 - Merge pull request #1307 from tkashem/revert-cert-rotation # oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-25-081133 True False 64m Cluster version is 4.11.0-0.nightly-2022-06-25-081133 $ cat scripts/check_secret_expiry.sh FILE="$1" if [ ! -f "$1" ]; then echo "must provide \$1" && exit 0 fi export IFS=$'\n' for i in `cat "$FILE"` do if `echo "$i" | grep "^#" > /dev/null`; then continue fi NS=`echo $i | cut -d ' ' -f 1` SECRET=`echo $i | cut -d ' ' -f 2` rm -f tls.crt; oc extract secret/$SECRET -n $NS --confirm > /dev/null echo "Check cert dates of $SECRET in project $NS:" openssl x509 -noout --dates -in tls.crt; echo done $ cat certs.txt openshift-kube-controller-manager-operator csr-signer-signer openshift-kube-controller-manager-operator csr-signer openshift-kube-controller-manager kube-controller-manager-client-cert-key openshift-kube-apiserver-operator aggregator-client-signer openshift-kube-apiserver aggregator-client openshift-kube-apiserver external-loadbalancer-serving-certkey openshift-kube-apiserver internal-loadbalancer-serving-certkey openshift-kube-apiserver service-network-serving-certkey openshift-config-managed kube-controller-manager-client-cert-key openshift-config-managed kube-scheduler-client-cert-key openshift-kube-scheduler kube-scheduler-client-cert-key Checking the Certs, they are with one day expiry times, this is as expected. # ./check_secret_expiry.sh certs.txt Check cert dates of csr-signer-signer in project openshift-kube-controller-manager-operator: notBefore=Jun 27 04:41:38 2022 GMT notAfter=Jun 28 04:41:38 2022 GMT Check cert dates of csr-signer in project openshift-kube-controller-manager-operator: notBefore=Jun 27 04:52:21 2022 GMT notAfter=Jun 28 04:41:38 2022 GMT Check cert dates of kube-controller-manager-client-cert-key in project openshift-kube-controller-manager: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of aggregator-client-signer in project openshift-kube-apiserver-operator: notBefore=Jun 27 04:41:37 2022 GMT notAfter=Jun 28 04:41:37 2022 GMT Check cert dates of aggregator-client in project openshift-kube-apiserver: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jun 28 04:41:37 2022 GMT Check cert dates of external-loadbalancer-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of internal-loadbalancer-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:49 2022 GMT notAfter=Jul 27 04:52:50 2022 GMT Check cert dates of service-network-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:28 2022 GMT notAfter=Jul 27 04:52:29 2022 GMT Check cert dates of kube-controller-manager-client-cert-key in project openshift-config-managed: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of kube-scheduler-client-cert-key in project openshift-config-managed: notBefore=Jun 27 04:52:47 2022 GMT notAfter=Jul 27 04:52:48 2022 GMT Check cert dates of kube-scheduler-client-cert-key in project openshift-kube-scheduler: notBefore=Jun 27 04:52:47 2022 GMT notAfter=Jul 27 04:52:48 2022 GMT # # cat check_secret_expiry_within.sh #!/usr/bin/env bash # usage: ./check_secret_expiry_within.sh 1day # or 15min, 2days, 2day, 2month, 1year WITHIN=${1:-24hours} echo "Checking validity within $WITHIN ..." oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date="+$WITHIN" +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before") \(.metadata.annotations."auth.openshift.io/certificate-not-after") \(.metadata.namespace)\t\(.metadata.name)"' # ./check_secret_expiry_within.sh 1day Checking validity within 1day ... 2022-06-27T04:41:37Z 2022-06-28T04:41:37Z openshift-kube-apiserver-operator aggregator-client-signer 2022-06-27T04:52:26Z 2022-06-28T04:41:37Z openshift-kube-apiserver aggregator-client 2022-06-27T04:52:21Z 2022-06-28T04:41:38Z openshift-kube-controller-manager-operator csr-signer 2022-06-27T04:41:38Z 2022-06-28T04:41:38Z openshift-kube-controller-manager-operator csr-signer-signer
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
In RHEL 8, the arping command (from iputils-s20180629) only returns 1 when used for duplicate address detection. In all other modes it returns 0 on success; 2 or -1 on error.
In RHEL 9, the arping command (from iputils 20210202) also returns 1 in other modes, essentially at random. (There is some kind of theory behind it, but even after multiple fixes to the logic it does not remotely work in any consistent way.)
How reproducible:
60-100% for individual arping commands
100% installation failure
Steps to reproduce:
Actual results:
arping returns 1
journal on the discovery ISO shows:
Jul 19 04:35:38 master-0 next_step_runne[3624]: time="19-07-2023 04:35:38" level=error msg="Error while processing 'arping' command" file="ipv4_arping_checker.go:28" error="exit status 1"
all hosts are marked invalid and install fails.
Expected results:
ideally arping returns 0
failing that, we should treat both 0 and 1 as success as previous versions of arping effectively did.
Sanitize OWNERS/OWNER_ALIASES:
1) OWNERS must have:
component: "Storage / Kubernetes External Components"
2) OWNER_ALIASES must have all team members of Storage team.
Refer to the CIS RedHat OpenShift Container Platform Benchmark PDF: https://drive.google.com/file/d/12o6O-M2lqz__BgmtBrfeJu1GA2SJ352c/view
1.1.7 Ensure that the etcd pod specification file permissions are set to 600 or more restrictive (Manual)
======================================================================================================
As per CIS v1.3 PDF permissions should be 600 with the following statement:
"The pod specification file is created on control plane nodes at /etc/kubernetes/manifests/etcd-member.yaml with permissions 644. Verify that the permissions are 600 or more restrictive."
But when I ran the following command it was showing 644 permissions
for i in $(oc get pods -n openshift-etcd -l app=etcd -o name | grep etcd ) do echo "check pod $i" oc rsh -n openshift-etcd $i \ stat -c %a /etc/kubernetes/manifests/etcd-pod.yaml done
Context:
We currently convey cloud creds issues in ValidOIDCConfiguration and ValidAWSIdentityProvider conditions.
The HO relies on those https://github.com/openshift/hypershift/blob/9e4127055dd7be9cfe4fc8427c39cee27a86efcd/hypershift-operator/controllers/hostedcluster/internal/platform/aws/aws.go#L293
to decide if forcefully deletion should be applied and so potentially intentionally leaving resources behind in cloud. (E.g. use case: oidc creds where broken out of band).
The CPO relies on those to wait for deletion of guest cluster resources https://github.com/openshift/hypershift/blob/8596f7f131169a19c6a67dc6ce078c50467de648/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L284-L299
DoD:
When any of the cases above results in the "move kube deletion forward skipping cloud resource deletion" path we should send a metric so consumers / SREs have a sense and can use it to notify customers in conjunction with https://issues.redhat.com/browse/SDA-8613
Description of the problem:
No limitation for Additional certificates UI field
How reproducible:
100%
Steps to reproduce:
1. create a cluster
2. On add host select 'Configure cluster-wide trusted certificates'
3. On Additional certificates, paste a big string
4. Generate Discovery ISO
Actual results:
UI send it to the BE
Expected results:
There should be a limitation on certificate field
This fix contains the following changes coming from updated version of kubernetes up to v1.27.4:
Changelog:
v1.27.4: https://github.com/kubernetes/kubernetes/blob/release-1.27/CHANGELOG/CHANGELOG-1.27.md#changelog-since-v1273
Description of problem:
I created a cluster with _workerLatencyProfile: LowUpdateSlowReaction_, then I edited the latencyProfile to MediumUpdateAverageReaction using documentation linked and this test case document below. Once I switched I waited for KubeControllerManager and KubeAPIServer to stop progressing/complete and noticed the nodeStatusUpdateFrequency under /etc/kubernetes/kubelet.conf does not change as expected
https://docs.google.com/document/d/19dPIE4WFxVc3ldu-hNoXiOkjBCQrHC6I7wfyaUyTDqw/edit#heading=h.kf4qxogy9r6
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-31-181848
How reproducible:
100%
Steps to Reproduce:
1. Create cluster with LowUpdateSlowReaction manifest: Example: https://docs.google.com/document/d/19dPIE4WFxVc3ldu-hNoXiOkjBCQrHC6I7wfyaUyTDqw/edit#heading=h.22najgyaj9lh 2. Validate values of low update profile components $ oc debug node/<worker-node-name> $ chroot /host $ sh-4.4# cat /etc/kubernetes/kubelet.conf | grep nodeStatusUpdateFrequency "nodeStatusUpdateFrequency": "1m0s", $ oc get KubeControllerManager -o yaml | grep -A 1 node-monitor node-monitor-grace-period: - 5m0s $ oc get KubeAPIServer -o yaml | grep -A 1 default- default-not-ready-toleration-seconds: - "60" Default-unreachable-toleration-seconds: - "60" 3. *oc edit nodes.config/cluster* spec: workerLatencyProfile: MediumUpdateAverageReaction 4. Wait for components to complete using oc get KubeControllerManager -o yaml | grep -i workerlatency -A 5 -B 5 and oc get KubeAPIServer -o yaml | grep -i workerlatency -A 5 -B 5 5. Validate medium component values, hitting error here
Actual results:
% oc get KubeControllerManager -o yaml | grep -A 1 node-monitor node-monitor-grace-period: - 2m0s prubenda@prubenda1-mac lrc % oc get KubeAPIServer -o yaml | grep -A 1 default- default-not-ready-toleration-seconds: - "60" default-unreachable-toleration-seconds: - "60" sh-5.1# cat /etc/kubernetes/kubelet.conf | grep nodeStatusUpdateFrequency "nodeStatusUpdateFrequency": "1m0s",
Expected results:
$ oc debug node/<worker-node-name> $ chroot /host $ sh-4.4# cat /etc/kubernetes/kubelet.conf | grep nodeStatusUpdateFrequency "nodeStatusUpdateFrequency": "20s", $ oc get KubeControllerManager -o yaml | grep -A 1 node-monitor node-monitor-grace-period: - 2m0s $ oc get KubeAPIServer -o yaml | grep -A 1 default- default-not-ready-toleration-seconds: - "60" default-unreachable-toleration-seconds: - "60"
Additional info:
In the documentation it states that workers will go disabled while the change is being applied and I never saw that occur
Description of problem:
Due to rpm-ostree regression (OKD-63) MCO was copying /var/lib/kubelet/config.json into /run/ostree/auth.json on FCOS and SCOS. This breaks Assisted Installer flow, which starts with Live ISO and doesn't have /var/lib/kubelet/config.json
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Context:
As a SRE / cluster service / dev I'd like to have the ability to identify trends on the duration of granular components that belong to HC/NodePools and that might affect our SLOs, e.g etcd, infra, ignition, nodes.
DoD:
Add metrics to visualise components duration of transitions.
Start with a few and agree on the approach.
Follow up.
Add a page to our documentation to describe what information needs to be gathered in the case of a failure/bug.
Document how to use the `hypershift dump cluster` command.
We are investigating issues with storage usage in production. Reverting until we have a root cause
Description of problem:
In an install where users bring their networks they also bring their own NSGs. However, the installer still creates NSG. In Azure environments using the rule [1] below, users are prohibited from installing cluster, as the apiserver_in rule has the rule set as 0.0.0.0[2]. Having a rule in place where the users could define this before install would allow them to set this connectivity without having the inbound access [1] - Rule: Network Security Groups shall not allow rule with 0.0.0.0/Any Source/Destination IP Addresses - Custom Deny [2] - https://github.com/openshift/installer/blob/master/data/data/azure/vnet/nsg.tf#L31
E2e tests fails because OpenShift Pipelines operator could not be found.
Description of problem:
Pipeline as a code has been GA for some time. So, we should remove the Tech preview badge from the PAC pages.
Version-Release number of selected component (if applicable):
4.13
Description of problem:
No timezone info in installer logs
Version-Release number of selected component (if applicable):
4.x
How reproducible:
100%
Steps to Reproduce:
1. openshift-install wait-for install-complete --dir=./foo 2. 3.
Actual results:
INFO Waiting up to 1h0m0s (until 4:52PM) for the cluster at https://api.ocp.example.local:6443 to initialize...
Expected results:
INFO Waiting up to 1h0m0s (until 4:52PM UTC) for the cluster at https://api.ocp.example.local:6443 to initialize...
Additional info:
Description of problem:
We should disable netlink mode of netclass collector in Node Exporter. The netlink mode of netclass collector is introduced in 4.13 into the Node Exporter. When using the netlink mode, several metrics become unavailable. So to avoid confusing our user when they upgrade the OCP cluster to a new version and find several metrics missing on the NICs.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Using default config of CMO, Node Exporter's netclass collector is running in netlink mode. The argument `--collector.netclass.netlink` is present in the `node-exporter` container in `node-exporter` daemonset.
Expected results:
Using default config of CMO, Node Exporter's netclass collector is running in classic mode. The argument `--collector.netclass.netlink` is absent in the `node-exporter` container in `node-exporter` daemonset.
Additional info:
this is a new test being added in 1.26, we'll be getting that after https://github.com/openshift/origin/pull/27694 merges
The OLM descriptors README references an "action" descriptor that was never implemented. This needs to be removed to eliminate confusion.
Description of problem:
I have to create this OCPBUG in order to backport a test to the 4.14 branch.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/222
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The following test is permafeailing in Prow CI: [tuningcni] sysctl allowlist update [It] should start a pod with custom sysctl only after adding sysctl to allowlist https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn-periodic/1640987392103944192 [tuningcni] 9915/go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:26 9916 sysctl allowlist update 9917 /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:141 9918 should start a pod with custom sysctl only after adding sysctl to allowlist 9919 /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:156 9920 > Enter [BeforeEach] [tuningcni] - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/pkg/execute/ginkgo.go:9 @ 03/29/23 10:08:49.855 9921 < Exit [BeforeEach] [tuningcni] - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/pkg/execute/ginkgo.go:9 @ 03/29/23 10:08:49.855 (0s) 9922 > Enter [BeforeEach] sysctl allowlist update - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:144 @ 03/29/23 10:08:49.855 9923 < Exit [BeforeEach] sysctl allowlist update - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:144 @ 03/29/23 10:08:49.896 (41ms) 9924 > Enter [It] should start a pod with custom sysctl only after adding sysctl to allowlist - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:156 @ 03/29/23 10:08:49.896 9925 [FAILED] Unexpected error: 9926 <*errors.errorString | 0xc00044eec0>: { 9927 s: "timed out waiting for the condition", 9928 } 9929 timed out waiting for the condition 9930 occurred9931 In [It] at: /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:186 @ 03/29/23 10:09:53.377
Version-Release number of selected component (if applicable):
master (4.14)
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Test fails
Expected results:
Test passes
Additional info:
PR https://github.com/openshift-kni/cnf-features-deploy/pull/1445 adds some useful information to the reported archive.
The installer offers a graph command to output its internal dependency graph. It could be useful to have a similar command, ie agent graph to output the specific agent dependency graph
Description of problem:
When import a Serverless Service from a git repository the topology shows an Open URL decorator also when "Add Route" checkbox was unselected (which is selected by default).
The created kn Route makes the Service available within the cluster and the created URL looks like this: http://nodeinfo-private.serverless-test.svc.cluster.local
So the Service is NOT accidentally exposed. It's "just" that we link an internal route that will not be accessible to the user.
This might happen also for Serverless functions import flow and the import container image import flow.
Version-Release number of selected component (if applicable):
Tested older versions and could see this at least on 4.10+
How reproducible:
Always
Steps to Reproduce:
Actual results:
The topology shows the new kn Service with a Open URL decorator on the top right corner.
The button is clickable but the target page could not be opened (as expected).
Expected results:
The topology should not show an Open URL decorator for "private" kn Routes.
The topology sidebar shows similar information, we should maybe release the Link there as well with a Text+Copy button???
A fix should be tested as well with Serverless functions as container images!
Additional info:
When the user unselects the "Add route" option an additional label is added to the kn Service. This label could also be added and removed later. When this label is specified the Open URL decorator should not be shown:
metadata: labels: networking.knative.dev/visibility: cluster-local
See also:
Description of problem:
SSH keys not configured on the worker nodes
Version-Release number of selected component (if applicable):
4.14.0-0.ci-2023-07-14-014011
How reproducible:
so far 100%
Steps to Reproduce:
1. Deploy baremetal cluster using IPI flow 2. 3.
Actual results:
Deployment succeeds but SSH keys not configured on the worker nodes
Expected results:
SSH keys configured on the worker nodes
Additional info:
SSH keys configured on the control-plane nodes
ssh core@master-0-0 'cat .ssh/authorized_keys.d/ignition' ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDm9hb6iTZJypEmzg4IZ767ze60UGhBWnjPXhovWVB7uKputdLzZhmlo36ifkXr/DTk8NGm47r6kXmz9NAF0pDHa5jX6yJFnhS4z5NY/mzsUX41gwiqBKYHgdp/KE1ylE8mbNon5ZpaaGvb876myjjPjPwWsD8hvXZirA5Q8TfDb/Pvgy1dhVH/uN05Ip1vVsp+bFGMPUJVWVUy/Eby5xW6OJv+FBOQq4nu6tslDZlHYXX2TSGrlW4x0i/oQMpKu/Y8ygAdjWqmAy6UBcho1nNWy15cp0jI5Fhjze171vSWZLAqJY+eFcL2kt/09RnY+MXyY/tIf+qNMyBE2Qltigah
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: creationTimestamp: "2023-07-14T12:13:00Z" generation: 1 labels: machineconfiguration.openshift.io/role: worker name: 99-worker-ssh resourceVersion: "2242" uid: 0ef02005-509e-4fc9-91ee-fc0afe27d5e6 spec: config: ignition: version: 3.2.0 passwd: users: - name: core sshAuthorizedKeys: - | ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDm9hb6iTZJypEmzg4IZ767ze60UGhBWnjPXhovWVB7uKputdLzZhmlo36ifkXr/DTk8NGm47r6kXmz9NAF0pDHa5jX6yJFnhS4z5NY/mzsUX41gwiqBKYHgdp/KE1ylE8mbNon5ZpaaGvb876myjjPjPwWsD8hvXZirA5Q8TfDb/Pvgy1dhVH/uN05Ip1vVsp+bFGMPUJVWVUy/Eby5xW6OJv+FBOQq4nu6tslDZlHYXX2TSGrlW4x0i/oQMpKu/Y8ygAdjWqmAy6UBcho1nNWy15cp0jI5Fhjze171vSWZLAqJY+eFcL2kt/09RnY+MXyY/tIf+qNMyBE2Qltigah extensions: null fips: false kernelArguments: null kernelType: "" osImageURL: ""
Description of problem:
After further discussion about https://issues.redhat.com/browse/RFE-3383 we have concluded that it needs to be addressed in 4.12 since OVNK will be default there. I'm opening this so we can backport the fix. The fix for this is simply to alter the logic around enabling nodeip-configuration to handle the VSphere-unique case of platform type == "vsphere" and the VIP field is not populated.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
4.14 jobs relying on LSO are failing because we should use the version N-1 for LSO.
Something similar to https://github.com/openshift/assisted-service/pull/4753 should be merged.
Actual results:
Job fail with:
++ make deploy_assisted_operator test_kube_api_parallel Error from server (NotFound): namespaces "assisted-spoke-cluster" not found error: the server doesn't have a resource type "clusterimageset" namespace "assisted-installer" deleted error: the server doesn't have a resource type "agentserviceconfigs" error: the server doesn't have a resource type "localvolume" Error from server (NotFound): catalogsources.operators.coreos.com "assisted-service-catalog" not found
Expected results:
Job should be a success
Description of problem:
Changes to platform fields e.g. aws instance type doesn't trigger a rolling upgrade
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a hostedCluster with nodepool on AWS 2. Change the instance type field on the nodepool spec.platfrom.aws
Actual results:
Machines are not restarted and the instance type didn't change
Expected results:
Machines are recreated with the new instance type
Additional info:
This is a result of the recent changes to CAPI which introduced in-place propagation to labels and annotations Soultion: MachineTemplate name should not be constant and should change with each spec change, so that spec.infraRef in the MachineDeployment is updated and a rolling upgrade is triggered.
In order to avoid possible issues with SDN during migration from SDN to OVNK, do not use port 9106 for ovnkube-control-plane metrics, since it's already used by SDN. Use a port that is not used by SDN, such as 9108.
Description of the problem:
Creating a cluster with ingress VIPs and user managed network will return an error
{ "lastProbeTime": "2023-03-01T18:50:41Z", "lastTransitionTime": "2023-03-01T18:50:41Z", "message": "The Spec could not be synced due to an input error: API VIP cannot be set with User Managed Networking", "reason": "InputError", "status": "False", "type": "SpecSynced" }
but setting ingress VIPs and user manged network to false and then edit only user managed network will not result in any error, will the cluster be using user managed network in this case?
How reproducible:
Steps to reproduce:
1. apply
apiVersion: extensions.hive.openshift.io/v1beta1
kind: AgentClusterInstall
metadata:
name: acimulinode
namespace: mfilanov
spec:
apiVIP: 1.2.3.8
apiVIPs:
- 1.2.3.8
clusterDeploymentRef:
name: multinode
imageSetRef:
name: img4.12.5-x86-64-appsub
ingressVIP: 1.2.3.10
platformType: BareMetal
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
serviceNetwork:
- 172.30.0.0/16
userManagedNetworking: false
provisionRequirements:
controlPlaneAgents: 3
compute:
- hyperthreading: Enabled
name: worker
controlPlane:
hyperthreading: Enabled
name: master
2. check conditions
kubectl get aci -n mfilanov -o json | jq .items[].status.conditions[] { "lastProbeTime": "2023-03-01T18:52:08Z", "lastTransitionTime": "2023-03-01T18:52:08Z", "message": "SyncOK", "reason": "SyncOK", "status": "True", "type": "SpecSynced" }
3. edit user managed network and apply again
apiVersion: extensions.hive.openshift.io/v1beta1
kind: AgentClusterInstall
metadata:
name: acimulinode
namespace: mfilanov
spec:
apiVIP: 1.2.3.8
apiVIPs:
- 1.2.3.8
clusterDeploymentRef:
name: multinode
imageSetRef:
name: img4.12.5-x86-64-appsub
ingressVIP: 1.2.3.10
platformType: BareMetal
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
serviceNetwork:
- 172.30.0.0/16
userManagedNetworking: true
provisionRequirements:
controlPlaneAgents: 3
compute:
- hyperthreading: Enabled
name: worker
controlPlane:
hyperthreading: Enabled
name: master
Actual results:
kubectl get aci -n mfilanov -o json | jq .items[].status.conditions[] { "lastProbeTime": "2023-03-01T18:52:08Z", "lastTransitionTime": "2023-03-01T18:52:08Z", "message": "SyncOK", "reason": "SyncOK", "status": "True", "type": "SpecSynced" }
Expected results:
probably should get an error because ingress vips already set
Description of problem:
While trying to update build01 from 4.13.rc2->4.13.rc3, the MCO degraded upon trying to upgrade the first master node. The error being: E0414 15:42:29.597388 2323546 writer.go:200] Marking Degraded due to: exit status 1 Which I mapped to this line: https://github.com/openshift/machine-config-operator/blob/release-4.13/pkg/daemon/update.go#L1551 I think this error can be improved since it is a bit confusing, but that's not the main problem. We noticed that the actual issue was that there is an existing "/home/core/.ssh" directory, that seemed to have been created by 4.13.rc2 (but could have been earlier), that belonged to the root user, as such when we attempted to create the folder via runuser core by hand, it failed with permission denied (and since we return the exec status, I think it just returned status 1 and not this error message). I am currently not sure if we introduced something that caused this issue. There was an ssh (only on master pool) in that build01 cluster for 600 days already, so it must have worked in the past? Workaround is to delete the .ssh folder and let the MCD recreate it
Version-Release number of selected component (if applicable):
4.13.rc3
How reproducible:
uncertain, but shouldn't be very high otherwise we would have ran into this in CI much more I think?
Steps to Reproduce:
1. create some 4.12 cluster with sshkey 2. upgrade to 4.13.rc2 3. upgrade to 4.13.rc3
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/oc/pull/1408
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
We are sending logs to `/api/assisted-install/v2/clusters/05811ea0-33ff-461d-8898-7aed48224218/logs?logs_type=node-boot&host_id=f6baac5b-65a4-5838-bba7-6a240f4ea9d3` kind of path indefinitely , as long as an host reboots.
When rebooting, it doesn't matter if already sent logs in the past, it will still send logs.
Example:
The above cluster installed successfully March 3rd 2023 @ 22:20:06
Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/235
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
After kube was bumped in cluster-kube-apiserver-operator the alert needs to use next Kubernetes version in promql query
Description of problem:
When the user's pull secret contains a JSON null in the "auth" or "email" keys, assisted service crashes when we attempt to create the cluster: May 31 21:06:27 example.dev.local service[3389]: time="2023-05-31T09:06:27Z" level=error msg="Failed to registered cluster example with id 3648b06e-4745-4542-9421-78ae2e249c0d" func="github. com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterClusterInternal.func1" file="/src/internal/bminventory/inventory.go:448" cluster_id=3648b06e-4745-4542-9421- 78ae2e249c0d go-id=162 pkg=Inventory request_id=1252f666-cf5c-4aae-9be7-7b7a579b5bf6 May 31 21:06:27 example.dev.local service[3389]: 2023/05/31 09:06:27 http: panic serving 10.116.24.118:46262: interface conversion: interface {} is nil, not string May 31 21:06:27 example.dev.local service[3389]: goroutine 162 [running]: May 31 21:06:27 example.dev.local service[3389]: net/http.(*conn).serve.func1() May 31 21:06:27 example.dev.local service[3389]: /usr/lib/golang/src/net/http/server.go:1850 +0xbf May 31 21:06:27 example.dev.local service[3389]: panic({0x25d0000, 0xc00148d7d0}) May 31 21:06:27 example.dev.local service[3389]: /usr/lib/golang/src/runtime/panic.go:890 +0x262 May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/internal/cluster/validations.ParsePullSecret({0xc001ed0780, 0x1c6}) May 31 21:06:27 example.dev.local service[3389]: /src/internal/cluster/validations/validations.go:106 +0x718 May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/internal/cluster/validations.(*registryPullSecretValidator).ValidatePullSecret(0xc0005880c0, {0xc001ed0780?, 0x7?}, {0x29916da, 0x5}) May 31 21:06:27 example.dev.local service[3389]: /src/internal/cluster/validations/validations.go:160 +0x54 May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).ValidatePullSecret(...) May 31 21:06:27 example.dev.local service[3389]: /src/internal/bminventory/inventory.go:279 May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterClusterInternal(0xc00112f880, {0x2fd3e20, 0xc00148cd50}, 0x0, {0xc0007c0400, 0xc0008d69a0}) May 31 21:06:27 example.dev.local service[3389]: /src/internal/bminventory/inventory.go:564 +0x16d0 May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).V2RegisterCluster(0x2fd3e20?, {0x2fd3e20?, 0xc00148cd50?}, {0xc0007c0400?, 0xc0008d69a0?}) May 31 21:06:27 example.dev.local service[3389]: /src/internal/bminventory/inventory_v2_handlers.go:42 +0x39 May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/restapi.HandlerAPI.func59({0xc0007c0400?, 0xc0008d69a0?}, {0x2390b20?, 0xc0014e0240?}) May 31 21:06:27 example.dev.local service[3389]: /src/restapi/configure_assisted_install.go:639 +0xaf May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/restapi/operations/installer.V2RegisterClusterHandlerFunc.Handle(0xc000a9d068?, {0xc0007c0400?, 0xc0008d69a0?}, {0x2390b20?, 0xc0014e0240?}) May 31 21:06:27 example.dev.local service[3389]: /src/restapi/operations/installer/v2_register_cluster.go:19 +0x3d May 31 21:06:27 example.dev.local service[3389]: github.com/openshift/assisted-service/restapi/operations/installer.(*V2RegisterCluster).ServeHTTP(0xc000571470, {0x2fc7140, 0xc00034c040}, 0xc0007c0400) May 31 21:06:27 example.dev.local service[3389]: /src/restapi/operations/installer/v2_register_cluster.go:66 +0x298 May 31 21:06:27 example.dev.local service[3389]: github.com/go-openapi/runtime/middleware.NewOperationExecutor.func1({0x2fc7140, 0xc00034c040}, 0xc0007c0400) May 31 21:06:27 example.dev.local service[3389]: /src/vendor/github.com/go-openapi/runtime/middleware/operation.go:28 +0x59
Version-Release number of selected component (if applicable):
4.12.17
How reproducible:
Probably 100%
Steps to Reproduce:
1. Add to the pull secret in install-config.yaml an auth like: "example.com": { "auth": null, "email": null } 2. Generate the agent ISO as usual using "openshift-install agent create image" 3. Boot the ISO on the cluster hosts.
Actual results:
The create-cluster-and-infraenv.service fails to complete. In its log it reports: Failed to register cluster with assisted-service: Post \"http://10.1.1.2:8090/api/assisted-install/v2/clusters\": EOF
Expected results:
Cluster is installed.
Additional info:
This is particularly difficult to debug because users don't generally give us their pull secrets. The pull secret file in the agent-gather bundle has individual fields redacted, so it is a better guide than the install-config where the whole thing may be redacted.
DoD:
Let the HO export a metric with it own version so as an SRE I can easily understand which version is running where by looking at a grafana dashboard.
Context:
As we start receiving metrics consistently in OCM environments and we are creating SLOs dashboards that can consume data from any data source Prod/stage/CI we also want to revisit how we are sending metrics and make sure we are doing it int the most effective way. We have some wonky data coming through in prod atm.
DoD:
Atm we have high frequency reconciliation loop where we constantly review the over all state of the world by looping over all clusters.
We should review this approach and record metrics/events as it happens directly in the controllers/reconcile loop only once and not repeatedly in a loop when possible for each specific metric.
Description of problem:
While mirror to filesystem, if 429 error is received from registry, the layer is incorrectly flagged as having been mirrored & therefore not picked up by subsequent mirror re-run requests. It gives the impression as mirror to file system in second attempt is successful. However, causing issue while mirroring from filesystem to target registry (Due to missing files)
Version-Release number of selected component (if applicable):
oc version Client Version: 4.8.42 Server Version: 4.8.14 Kubernetes Version: v1.21.1+a620f50
How reproducible:
When 429 occurs while mirror to file system
Steps to Reproduce:
1. Run mirror to filesystem command : oc image mirror -f mirror-to-filesystem.txt --filter-by-os '.*' -a $REGISTRY_AUTH_FILE --insecure --skip-multiple-scopes --max-per-registry=1 --continue-on-error=true --dir "$LOCAL_DIR_PATH" Output: info: Mirroring completed in 2h19m24.14s (25.75MB/s) error: one or more errors occurred E.g error: unable to push <registry>/namespace/<image-name>: failed to retrieve blob <image-digest>: error parsing HTTP 429 response body: unexpected end of JSON input: "" 2. Re Run mirror to filesystem command : oc image mirror -f mirror-to-filesystem.txt --filter-by-os '.*' -a $REGISTRY_AUTH_FILE --insecure --skip-multiple-scopes --max-per-registry=1 --continue-on-error=true --dir "$LOCAL_DIR_PATH" Output: info: Mirroring completed in 480ms (0B/s) 3. Run mirror from filesystem command : oc image mirror -f mirror-from-filesystem.txt -a $REGISTRY_AUTH_FILE --from-dir "$LOCAL_DIR_PATH" --filter-by-os '.*' --insecure --skip-multiple-scopes --max-per-registry=1 --continue-on-error=true Output: info: Mirroring completed in 53m5.21s (67.61MB/s) error: one or more errors occurred E.g error: unable to push file://local/namespace/<image-name>: failed to retrieve blob <image-digest>: open /root/local/namespace/<image-name>/blobs/<image-digest>: no such file or directory
Actual results:
1) mirror to filesystem first attempt: info: Mirroring completed in 2h19m24.14s (25.75MB/s) error: one or more errors occurred E.g error: unable to push <registry>/namespace/<image-name>: failed to retrieve blob <image-digest>: error parsing HTTP 429 response body: unexpected end of JSON input: "" 2) mirror to filesystem second attempt: info: Mirroring completed in 480ms (0B/s) 3) mirror from filesystem to target registry: info: Mirroring completed in 53m5.21s (67.61MB/s) error: one or more errors occurred E.g error: unable to push file://local/namespace/<image-name>: failed to retrieve blob <image-digest>: open /root/local/namespace/<image-name>/blobs/<image-digest>: no such file or directory
Expected results:
source image mirror -> to file system and image mirror from file system -> target registry should complete successfully
Additional info:
Description of the problem:
Currently the `pre-network-manager-config.service` that we use to create static network configurations from the non minimal discovery ISO may run after NetworkManager, and therefore the configurations that it generates may be ignored.
How reproducible:
Not always reproducible, it is time sensitive. Has been observed when there is a large number of static network configurations. See OCPBUGS-16219 for details and steps to reproduce.
Please review the following PR: https://github.com/openshift/console-operator/pull/737
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
4.14 indexes have been bootstrapped and published on the registry. I was told they have to be added to https://github.com/operator-framework/operator-marketplace/blob/master/defaults/03_community_operators.yaml until they can be used in OCP clusters.
Version-Release number of selected component (if applicable):
OCP 4.14
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
4.14 indexes were bootstrapped in CLOUDDST-17591
Description of problem:
Observation from CISv1.4 pdf: 1.1.9 Ensure that the Container Network Interface file permissions are set to 600 or more restrictive “Container Network Interface provides various networking options for overlay networking. You should consult their documentation and restrict their respective file permissions to maintain the integrity of those files. Those files should be writable by only the administrators on the system.” To conform with CIS benchmarksChange, the /var/run/multus/cni/net.d/*.conf files on nodes should be updated to 600. $ for i in $(oc get pods -n openshift-multus -l app=multus -oname); do oc exec -n openshift-multus $i -- /bin/bash -c "stat -c \"%a %n\" /host/var/run/multus/cni/net.d/*.conf"; done 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf 644 /host/var/run/multus/cni/net.d/80-openshift-network.conf
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
The file permissions of /var/run/multus/cni/net.d/*.conf on nodes is 644.
Expected results:
The file permissions of /var/run/multus/cni/net.d/*.conf on nodes should be updated to 600
Additional info:
Description of problem:
OCM-o does not support obtaining verbosity through OpenShiftControllerManager.operatorLogLevel object
Version-Release number of selected component (if applicable):
How reproducible:
modify the OpenShiftControllerManager.operatorLogLevel, and the OCM-o operator will not display the correspond logs
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/91
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
As a cluster-admin, users can see pipelines section while using the `import from git` feature in the developer mode from web console. However if users logged in as a normal user or a project admin, they are not able to see the pipelines section.
Version-Release number of selected component (if applicable):
Tested in OCP v4.12.18 and v4.12.20
How reproducible:
Always
Steps to Reproduce:
Prerequisite- Install Red Hat OpenShift pipelines operator 1. Login as a kube-admin user from web console 2. Go to Developer View 3. Click on +Add 4. Under Git Repository, open page -> Import from git 5. Enter Git Repo URL (example git url- https://github.com/spring-projects/spring-petclinic) 6. Check if there are 3 section : General , Pipelines , Advance options 7. Then Login as a project admin user 8. Perform all the steps again from step 2 to step 6
Actual results:
Pipelines section is not visible when logged in as a project admin. Only General and Advance options sections are visible in import from git. However Pipeline section is visible as a cluster-admin.
Expected results:
Pipelines section should be visible when logged in as a project admin, along with General and Advance options sections in import from git.
Additional info:
I checked by creating a separate rolebinding and clusterrolebindings to assign access for pipeline resources like below : ~~~ $ oc create clusterrole pipelinerole1 --verb=create,get,list,patch,delete --resource=tektonpipelines,openshiftpipelinesascodes $ oc create clusterrole pipelinerole2 --verb=create,get,list,patch,delete --resource=repositories,pipelineruns,pipelines $ oc adm policy add-cluster-role-to-user pipelinerole1 user1 $ oc adm policy add-role-to-user pipelinerole2 user1 ~~~ However, even after assigning these rolebindings/clusterrolebinsings to the users , users are not able to see the Pipelines section.
Description of problem:
oc explain tests have to be enabled to ensure openapi/v3 is working properly The tests have been temporarily disabled in order to unblock the oc kube bump (https://github.com/openshift/oc/pull/1420). The following efforts need to be done/merged to make openapi/v3 work: - [DONE] oauth-apiserver kube bump: https://github.com/openshift/oauth-apiserver/pull/89 - [DONE] merge kubectl fix backport https://github.com/kubernetes/kubernetes/pull/118930 and bump kube dependency in oc to include this fix (https://github.com/openshift/oc/pull/1515) - [DONE] merge https://github.com/kubernetes/kubernetes/pull/118881 and carry this PR in our kube-apiserver to stop oc explain being flaky (https://github.com/openshift/kubernetes/pull/1629) - [DONE] merge https://github.com/kubernetes/kubernetes/pull/118879 and carry this PR in our kube-apiserver to enable apiservices (https://github.com/openshift/kubernetes/pull/1630) - [DONE] make openapi/v3 work for our special groups https://github.com/openshift/kubernetes/pull/1654 (https://github.com/openshift/kubernetes/pull/1617#issuecomment-1609864043, slack discussion: https://redhat-internal.slack.com/archives/CC3CZCQHM/p1687882255536949?thread_ts=1687822265.954799&cid=CC3CZCQHM) - [DONE] enable back oc explain tests: https://github.com/openshift/origin/pull/28155 and bring in new tests: https://github.com/openshift/origin/pull/28129 - [OPTIONAL] bring in additional upstream kubectl/oc explain tests: https://github.com/kubernetes/kubernetes/pull/118885 - [OPTIONAL] backport https://github.com/kubernetes/kubernetes/pull/119839 and https://github.com/kubernetes/kubernetes/pull/119841 (backport of https://github.com/kubernetes/kubernetes/pull/118881 and https://github.com/kubernetes/kubernetes/pull/118879)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
OCP upgrade blocks because of cluster operator csi-snapshot-controller fails to start its deployment with a fatal message of read-only filesystem
Version-Release number of selected component (if applicable):
Red Hat OpenShift 4.11 rhacs-operator.v3.72.1
How reproducible:
At least once in user's cluster while upgrading
Steps to Reproduce:
1. Have a OCP 4.11 installed 2. Install ACS on top of the OCP cluster 3. Upgrade OCP to the next z-stream version
Actual results:
Upgrade gets blocked: waiting on csi-snapshot-controller
Expected results:
Upgrade should succeed
Additional info:
stackrox SCCs (stackrox-admission-control, stackrox-collector and stackrox-sensor) contain the `readOnlyRootFilesystem` set to `true`, if not explicitly defined/requested, other Pods might receive this SCC which will make the deployment to fail with a `read-only filesystem` message
Description of problem:
CCPMSO uses a copy of the manifests from openshift/api. However, these appear out-of-sync with respect to the vendored version of openshift/api
Description of problem:
Cluster-api pod can't create events due to RBAC. we may miss some useful event due to this.
E0503 07:20:44.925786 1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"ad1-workers-f5f568855-vnzmn.175b911e43aa3f41", GenerateName:"", Namespace:"ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Machine", Namespace:"ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1", Name:"ad1-workers-f5f568855-vnzmn", UID:"2b40a694-d36d-4b13-9afc-0b5daeecc509", APIVersion:"cluster.x-k8s.io/v1beta1", ResourceVersion:"144260357", FieldPath:""}, Reason:"DetectedUnhealthy", Message:"Machine ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1/ad1-workers/ad1-workers-f5f568855-vnzmn/ has unhealthy node ", Source:v1.EventSource{Component:"machinehealthcheck-controller", Host:""}, FirstTimestamp:time.Date(2023, time.May, 3, 7, 20, 44, 923289409, time.Local), LastTimestamp:time.Date(2023, time.May, 3, 7, 20, 44, 923289409, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1:cluster-api" cannot create resource "events" in API group "" in the namespace "ocm-integration-23frm3gtnh3cf212daoe1a13su7buqk4-ad1"' (will not retry!)
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Create an hosted cluster 2. Check cluster-api pod for some kind of error (e.g. slow node startup) 3.
Actual results:
Error
Expected results:
Event generated
Additional info:
ClusterRole hypershift-cluster-api is created here https://github.com/openshift/hypershift/blob/e7eb32f259b2a01e5bbdddf2fe963b82b331180f/hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go#L2720
We should add create/patch/update for events there
Description of problem:
IPI installation failed in AWS, CreateVpcEndpoint not supported in C2S region
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
IPI installation in AWS 1. terraform apply 2. When using a aws_vpc_endpoint resource with aws terraform provider >= 2.53.0 in the C2S regions (us-iso*), an error is thrown stating UnsupportedOperation. 3.
Actual results:
Unable to install OCP 4.X in AWS C2S(top-secret) region
Expected results:
IPI installation in AWS C2S region
Additional info:
Upstream bug: [Bug]: C2S CreateVpcEndpoint UnsupportedOperation: The operation is not supported in this region! · Issue #27048 · hashicorp/terraform-provider-aws · GitHub https://github.com/hashicorp/terraform-provider-aws/issues/27048
Description of problem:
After adding additional CPU and Memory to the OpenShift Container Platform 4 - Control-Plane Node(s) it was noticed that a new MachineConfig was rolled out, causing all OpenShift Container Platform 4 - Node(s) to reboot unexpected. Interesting enough, no new MachineConfig was rendered but actually a slightly older MachineConfig was picked and applied to all OpenShift Container Platform 4 - Node after the change on the OpenShift Container Platform 4 - Control-Plane Node(s) was performed. The only visible change found in the MachineConfig was that nodeStatusUpdateFrequency was updated from 10s to 0s even though nodeStatusUpdateFrequency is not specified or configured in any MachineConfig or KubeletConfig. https://issues.redhat.com/browse/OCPBUGS-6723 was found but given that the affected OpenShift Container Platform 4 - Cluster is running 4.11.35 it's difficult to understand what happen as generally this problem was/is suspected to be solved.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.11.35
How reproducible:
Unknown
Steps to Reproduce:
1. OpenShift Container Platform 4 on AWS 2. Updating OpenShift Container Platform 4 - Control-Plane Node(s) to add more CPU and Memory 3. Check whether a potential MachineConfig update is being applied
Actual results:
MachineConfig update is being rolled out to all OpenShift Container Platform 4 - Node(s) after adding CPU and Memoy to OpenShift Container Platform 4 - Control-Plane Node(s) as nodeStatusUpdateFrequency is being updated, which is rather unexpected or not clear why it's happening.
Expected results:
Either no new MachineConfig to rollout after such a change or else to have a newly rendered MachineConfig that is being rolled out with information of what changed and why this change was applied
Additional info:
This is a clone of issue OCPBUGS-18832. The following is the description of the original issue:
—
Description of problem:
console does not enable customizing the abbreviation that appears on the resource icon badge. This causes an issue for the FAR operator with the CRD FenceAgentRemediationTemplate, the badge icon shows FART. The CRD includes a custom short name, but the console ignores it
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. create the CRD (included link to github) 2. navigate to Home -> search 3. Enter far into the Resources filter
Actual results:
The badge FART shows in the dropdown
Expected results:
The badge should show fartemplate - the content of the short name
Additional info:
Description of problem:
install discnnect private cluster, ssh to master/bootstrap nodes from the bastion on the vpc failed.
Version-Release number of selected component (if applicable):
Pre-merge build https://github.com/openshift/installer/pull/6836 registry.build05.ci.openshift.org/ci-ln-5g4sj02/release:latest Tag: 4.13.0-0.ci.test-2023-02-27-033047-ci-ln-5g4sj02-latest
How reproducible:
always
Steps to Reproduce:
1.Create bastion instance maxu-ibmj-p1-int-svc 2.Create vpc on the bastion host 3.Install private disconnect cluster on the bastion host with mirror registry 4.ssh to the bastion 5.ssh to the master/bootstrap nodes from the bastion
Actual results:
[core@maxu-ibmj-p1-int-svc ~]$ ssh -i ~/openshift-qe.pem core@10.241.0.5 -v OpenSSH_8.8p1, OpenSSL 3.0.5 5 Jul 2022 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config debug1: configuration requests final Match pass debug1: re-parsing configuration debug1: Reading configuration data /etc/ssh/ssh_config debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config debug1: Connecting to 10.241.0.5 [10.241.0.5] port 22. debug1: connect to address 10.241.0.5 port 22: Connection timed out ssh: connect to host 10.241.0.5 port 22: Connection timed out
Expected results:
ssh succeed.
Additional info:
$ibmcloud is sg-rules r014-5a6c16f4-8a4c-4c02-ab2d-626c14f72a77 --vpc maxu-ibmj-p1-vpc Listing rules of security group r014-5a6c16f4-8a4c-4c02-ab2d-626c14f72a77 under account OpenShift-QE as user ServiceId-dff277a9-b608-410a-ad24-c544e59e3778... ID Direction IP version Protocol Remote r014-6739d68f-6827-41f4-b51a-5da742c353b2 outbound ipv4 all 0.0.0.0/0 r014-06d44c15-d3fd-4a14-96c4-13e96aa6769c inbound ipv4 all shakiness-perfectly-rundown-take r014-25b86956-5370-4925-adaf-89dfca9fb44b inbound ipv4 tcp Ports:Min=22,Max=22 0.0.0.0/0 r014-e18f0f5e-c4e5-44a5-b180-7a84aa59fa97 inbound ipv4 tcp Ports:Min=3128,Max=3129 0.0.0.0/0 r014-7e79c4b7-d0bb-4fab-9f5d-d03f6b427d89 inbound ipv4 icmp Type=8,Code=0 0.0.0.0/0 r014-03f23b04-c67a-463d-9754-895b8e474e75 inbound ipv4 tcp Ports:Min=5000,Max=5000 0.0.0.0/0 r014-8febe8c8-c937-42b6-b352-8ae471749321 inbound ipv4 tcp Ports:Min=6001,Max=6002 0.0.0.0/0
We should garbage collect also failed to register events and possibly other orphaned events
Due to enabling upstream node-logs viewer feature we have to temporarily disable this test, since the plan to switch to upstream version requires the following steps in order:
1. Modify current patches to match upstream change (being done as part of 1.27 bump)
2. Modify oc to work with both old and new API (being done in parallel with 1.27 bump, will be linked below).
3. Land k8s 1.27.
4. Modify machine-config-operator to enable enableSystemLogQuery config option (can land only after k8s 1.27, will be linked below).
5. Bring the test back.
Our telemetry test using remote write is increasingly flaky. The recurring error is:
TestTelemeterRemoteWrite telemeter_test.go:103: timed out waiting for the condition: error validating response body "{\"status\":\"success\",\"data\":{\"resultType\":\"vector\",\"result\":[{\"metric\":{\"container\":\"kube-rbac-proxy\",\"endpoint\":\"metrics\",\"job\":\"prometheus-k8s\",\"namespace\":\"openshift-monitoring\",\"remote_name\":\"2bdd72\",\"service\":\"prometheus-k8s\",\"url\":\"https://infogw.api.openshift.com/metrics/v1/receive\"},\"value\":[1684889572.197,\"20.125925925925927\"]}]}}" for query "max without(pod,instance) (rate(prometheus_remote_storage_samples_failed_total{job=\"prometheus-k8s\",url=~\"https://infogw.api.openshift.com.+\"}[5m]))": expecting Prometheus remote write to see no failed samples but got 20.125926
Any failed samples will cause this test to fail. This is perhaps a too strict requirement. We could consider it good enough if some samples are send successfully. The current version tests telemeter behavior on top of CMO behavior.
Description of problem:
When running the installer on OSP with: [...] controlPlane: name: master platform: {} replicas: 3 [...] in the install-config.yaml, it panics: DEBUG OpenShift Installer 4.14.0-0.nightly-2023-07-20-215234 DEBUG Built from commit 1e9209ac80ed2cb4ba5663f519e51161a1d8858a DEBUG Fetching Metadata... DEBUG Loading Metadata... DEBUG Loading Cluster ID... DEBUG Loading Install Config... DEBUG Loading SSH Key... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Cluster Name... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Networking... DEBUG Loading Platform... DEBUG Loading Pull Secret... DEBUG Loading Platform... panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x3956f6d]goroutine 1 [running]: github.com/openshift/installer/pkg/types/conversion.convertOpenStack(0xc000464dc0) /go/src/github.com/openshift/installer/pkg/types/conversion/installconfig.go:172 +0x1cd github.com/openshift/installer/pkg/types/conversion.ConvertInstallConfig(0xc000464dc0) /go/src/github.com/openshift/installer/pkg/types/conversion/installconfig.go:47 +0x2af github.com/openshift/installer/pkg/asset/installconfig.(*AssetBase).LoadFromFile(0xc000a18180, {0x20f8c650?, 0xc000696b40?}) /go/src/github.com/openshift/installer/pkg/asset/installconfig/installconfigbase.go:64 +0x32b github.com/openshift/installer/pkg/asset/installconfig.(*InstallConfig).Load(0xc000a18180, {0x20f8c650?, 0xc000696b40?}) /go/src/github.com/openshift/installer/pkg/asset/installconfig/installconfig.go:118 +0x2e github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc0008f3f20, {0x20f95950, 0xc0002f9a40}, {0xc000af060c, 0x4}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:263 +0x35f github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc0008f3f20, {0x20f95920, 0xc00040cf60}, {0x819d89a, 0x2}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:246 +0x256 github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc0008f3f20, {0x7fed58b9ec98, 0x25ba8530}, {0x0, 0x0}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:246 +0x256 github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc0008f3f20, {0x7fed58b9ec98, 0x25ba8530}, {0x0, 0x0}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:200 +0x1a9 github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7ffd6b4992ff?, {0x7fed58b9ec98, 0x25ba8530}, {0x25b8ea80, 0x8, 0x8}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:76 +0x48 main.runTargetCmd.func1({0x7ffd6b4992ff, 0x6}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:260 +0x126 main.runTargetCmd.func2(0x25b96920?, {0xc0002f8100?, 0x4?, 0x4?}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:290 +0xe7 github.com/spf13/cobra.(*Command).execute(0x25b96920, {0xc0002f80c0, 0x4, 0x4}) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:920 +0x847 github.com/spf13/cobra.(*Command).ExecuteC(0xc000a0c000) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1040 +0x3bd github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:968 main.installerMain() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0 main.main() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Always
Steps to Reproduce:
1. Create the install-config.yaml with an empty controlPlane.platform 2. Run the installer
Actual results:
Panic
Expected results:
Controlled error message if the platform is strictly necessary, otherwise a successful installation.
Additional info:
Description of problem:
When use the command `oc-mirror --config config-oci-target.yaml docker://localhost:5000 --use-oci-feature --dest-use-http --dest-skip-tls` , the command exit with code 0, but print log like : unable to parse reference oci://mno/redhat-operator-index:v4.12: lstat /mno: no such file or directory.
Version-Release number of selected component (if applicable):
oc-mirror version Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.13.0-202303011628.p0.g2e3885b.assembly.stream-2e3885b", GitCommit:"2e3885b469ee7d895f25833b04fd609955a2a9f6", GitTreeState:"clean", BuildDate:"2023-03-01T16:49:12Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1. with imagesetconfig like : cat config-oci-target.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 storageConfig: local: path: /home/ocmirrortest/0302/60597 mirror: operators: - catalog: oci:///home/ocmirrortest/noo/redhat-operator-index targetCatalog: mno/redhat-operator-index targetTag: v4.12 packages: - name: aws-load-balancer-operator `oc-mirror --config config-oci-target.yaml docker://localhost:5000 --use-oci-feature --dest-use-http --dest-skip-tls`
Actual results:
1. the command exit with code 0, but print strange logs like: sha256:95c45fae0ca9e9bee0fa2c13652634e726d8133e4e3009b363fcae6814b3461d localhost:5000/albo/aws-load-balancer-rhel8-operator:95c45f sha256:ab38b37c14f7f0897e09a18eca4a232a6c102b76e9283e401baed832852290b5 localhost:5000/albo/aws-load-balancer-rhel8-operator:ab38b3 info: Mirroring completed in 43.87s (28.5MB/s) Rendering catalog image "localhost:5000/mno/redhat-operator-index:v4.12" with file-based catalog Writing image mapping to oc-mirror-workspace/results-1677743154/mapping.txt Writing CatalogSource manifests to oc-mirror-workspace/results-1677743154 Writing ICSP manifests to oc-mirror-workspace/results-1677743154 unable to parse reference oci://mno/redhat-operator-index:v4.12: lstat /mno: no such file or directory
Expected results:
no such log .
Description of problem:
While troubleshooting a problem, oc incorrectly recommended to use a deprecated command "oc admin registry" in the output text.
Version-Release number of selected component (if applicable):
$ oc version Client Version: 4.12.0-202302280915.p0.gb05f7d4.assembly.stream-b05f7d4 Kustomize Version: v4.5.7 Server Version: 4.12.6 Kubernetes Version: v1.25.4+18eadca Though this is likely broken in all previous version of openshift4
How reproducible:
Only during error conditions where this error message is printed.
Steps to Reproduce:
1. have cluster without proper storage configured for the registry 2. try to build something. 3. "oc status --suggest" prints message with deprecated "oc admin registry" command.
Actual results:
$ oc status --suggest In project pvctest on server https://api.pelauter-bm01.lab.home.arpa:6443https://my-test-pvctest.apps.pelauter-bm01.lab.home.arpa (redirects) to pod port 8080-tcp (svc/my-test) deployment/my-test deploys istag/my-test:latest <- bc/my-test source builds https://github.com/sclorg/django-ex.git on openshift/python:3.9-ubi8 build #1 new for 3 hours (can't push to image) deployment #1 running for 3 hours - 0/1 podsErrors: * bc/my-test is pushing to istag/my-test:latest, but the administrator has not configured the integrated container image registry. try: oc adm registry -h ^ oc adm regisistry is deprecated in openshift4, this should guide the user to the registry operator.
Expected results:
A pointer to the proper feature to manage the registry, like the openshift registry operator.
Additional info:
I know my cluster is not set up correctly, but oc should still not give me incorrect information. If this version of oc is expected to also work against ocp3 clusters, the fix should take this into account, where that command is still valid.
Description of problem:
CCO watches too many things.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Run CCO in a cluster with a large amount of data in ConfigMaps or Secrets or Namespaces. 2. Watch memory usage scale linearly with the size of both. 3.
Actual results:
Memory usage scales linearly with the size of all ConfigMaps, Secrets and Namespaces on the cluster.
Expected results:
Memory usage scales linearly with the data CCO actually needs to function.
Additional info:
Description of problem:
External link icon in `resource added` toast notification not linked and cannot be clicked to open the app URL.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Steps to Reproduce:
1. use the +Add page and import from git 2. after creating the app a toast notification will appear 3. Click the external link icon
Actual results:
External link icon is not part of the link but has a pointer cursor and a hover effect. Clicking this icon does nothing.
Expected results:
External link icon should be part of the link and clickable.
Additional info:
We set image links on CMO's jsonnet code, as these can sometimes be used to populate labels and is generally considered good documentation pratice.
In a cluster these links are replaced by CVO.
prometheus-adapter is now a k8s project and has moved locations accordingly from directxman12/k8s-prometheus-adapter to kubernetes-sigs/prometheus-adapter. This should be reflected in our image links, set at https://github.com/openshift/cluster-monitoring-operator/blob/35a063722c7e3b68d57aed18dc81f0dbdfbfc004/jsonnet/main.jsonnet#L66.
Description of the problem:
In Staging, BE 2.20.1 - trying to set "Integrate with platform" switch on, getting:
Failed to update the cluster
only x86-64 CPU architecture is supported on Nutanix clusters
How reproducible:
100%
Steps to reproduce:
1. Create new cluster with OCP multi version
2. Discover NTNX hosts and turn integrate with platform on
3.
Actual results:
Expected results:
Description of problem:
Reported in https://github.com/openshift/cluster-ingress-operator/issues/911
When you open a new issue, it still directs you to Bugzilla, and then doesn't work.
It can be changed here: https://github.com/openshift/cluster-ingress-operator/blob/master/.github/ISSUE_TEMPLATE/config.yml
, but to what?
The correct Jira link is
https://issues.redhat.com/secure/CreateIssueDetails!init.jspa?pid=12332330&issuetype=1&components=12367900&priority=10300&customfield_12316142=26752
But can the public use this mechanism? Yes - https://redhat-internal.slack.com/archives/CB90SDCAK/p1682527645965899
Version-Release number of selected component (if applicable):
n/a
How reproducible:
May be in other repos too.
Steps to Reproduce:
1. Open Issue in the repo - click on New Issue 2. Follow directions and click on link to open Bugzilla 3. Get message that this doesn't work anymore
Actual results:
You get instructions that don't work to open a bug from an Issue.
Expected results:
You get instructions to just open an Issue, or get correct instructions on how to open a bug using Jira.
Additional info:
We need to enable vSphere CSI driver to use UseCSINodeID feature, so as it is on feature parity with upstream.
Description of problem:
Create a private Shared VPC cluster on AWS, Ingress operator degraded due to the following error: 2023-06-14T09:55:50.240Z INFO operator.dns_controller controller/controller.go:118 reconciling {"request": {"name":"default-wildcard","namespace":"openshift-ingress-operator"}} 2023-06-14T09:55:50.363Z ERROR operator.dns_controller dns/controller.go:354 failed to publish DNS record to zone {"record": {"dnsName":"*.apps.ci-op-2x6lics3-849ce.qe.devcluster.openshift.com.","targets":["internal-ac656ce4d29f64da289152053f50c908-1642793317.us-east-1.elb.amazonaws.com"],"recordType":"CNAME","recordTTL":30,"dnsManagementPolicy":"Managed"}, "dnszone": {"id":"Z0698684SM2RRJSYHP43"}, "error": "failed to get hosted zone for load balancer target \"internal-ac656ce4d29f64da289152053f50c908-1642793317.us-east-1.elb.amazonaws.com\": couldn't find hosted zone ID of ELB internal-ac656ce4d29f64da289152053f50c908-1642793317.us-east-1.elb.amazonaws.com"} ingress operator: ingress False True True 37m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DNSReady=False (FailedZones: The record failed to provision in some zones: [{Z0698684SM2RRJSYHP43 map[]}])
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-13-223353
How reproducible:
always
Steps to Reproduce:
1. Create a private Shared VPC cluster on AWS using STS
Actual results:
ingress operator degraded
Expected results:
cluster is healthy
Additional info:
public cluster no such issue.
Description of problem:
Older images are pulled even when using minVersion in ImageSetConfiguration.
Version-Release number of selected component (if applicable):
oc mirror version
Client Version: version.Info
How reproducible:
Always
Steps to Reproduce:
1. get attached ImageSetConfiguration
2. run 'oc mirror --config=./image-set.yaml docker://<yourRegistry> --continue-on-error'
Actual results:
Output contains a lot of 'unable to retrieve source image' errors for images which are older than defined in minVersion (those images are known to be missing, a goal was to use minVersion to filter out those older images to get rid of those errors but it's not working)
Expected results:
Those older images should not be included
Additional info:
image-set.yaml is attached
Full output of 'oc mirror' attached
There are more images failing but as an example:
error: unable to retrieve source image registry.redhat.io/openshift-service-mesh/pilot-rhel8 manifest sha256:f7c468b5a35bfce54e53b4d8d00438f33a0861549697d14445eae52d8ead9a68: for image pulls. Use the equivalent V2 schema 2 manifest digest instead. For more information see https://access.redhat.com/articles/6138332
This image is from version 1.0.11 but minVersion: '2.2.1-0' so it should not be included.
Here is how I checked that image:
podman inspect registry-proxy.engineering.redhat.com/rh-osbs/openshift-service-mesh-pilot-rhel8@sha256:f7c468b5a35bfce54e53b4d8d00438f33a0861549697d14445eae52d8ead9a68 | grep version "istio_version": "1.1.17", "version": "1.0.11" "istio_version": "1.1.17", "version": "1.0.11"
This is a clone of issue OCPBUGS-8512. The following is the description of the original issue:
—
Description of problem:
WebhookConfiguration caBundle injection is incorrect when some webhooks already configured with caBundle. Behavior seems to be that the first n number of webhooks in `.webhooks` array have caBundle injected, where n is the number of webhooks that do not have caBundle set.
Version-Release number of selected component (if applicable):
How reproducible
Steps to Reproduce:
1. Create a validatingwebhookconfigurations or mutatingwebhookconfigurations with `service.beta.openshift.io/inject-cabundle: "true"` annotation. 2. oc edit validatingwebhookconfigurations (or oc edit mutatingwebhookconfigurations) 3. Add a new webhook to the end of the list `.webhooks`. It will not have caBundle set manually as service-ca should inject it. 4. Observe new webhook does not get caBundle injected. Note: it is important in step. 3 that the new webhook is added to the end of the list.
Actual results:
Only the first n webhooks have caBundle injected where n is the number of webhooks without caBundle set.
Expected results:
All webhooks have caBundle injected when they do not have it set.
Additional info:
Open PR here: https://github.com/openshift/service-ca-operator/pull/207 The issue seems to be a mistake with go-lang for range syntax where "i" is the index of desired "i" to update. tl dr; code should update the value of the int in the array, not the index of the int in the array.
Description of problem:
monitoringPlugin tolerations not working
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
apply monitoringPlugin tolerations to cm `cluster-monitoring-config`
example: ... monitoringPlugin: tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule"
Actual results:
the cm applyed but not take effect to the deployment
Expected results:
able to see the tolerations applyed to deployment/pod
Additional info:
same condition to NodeSelector, TopologySpreadConstraints
Description of problem:
The prometheus-operator pod has the "app.kubernetes.io/version: 0.63.0" annotation while it's based on 0.65.1.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Check app.kubernetes.io/version annotations for prometheus-operator pod. 2. 3.
Actual results:
0.63.0
Expected results:
0.65.1
Additional info:
This is a clone of issue OCPBUGS-19715. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Due to the EOL of RHV in OCP, we'll need to disable oVirt as an installation option in the installer.
Note: The first step is disabling it. Removing all related code from the installer will be done in a later release.
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/898
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The dev workflow for OCP operators wanting to use feature gates is
1) change openshift/api
2) bump openshift/api in cluster-config-operator (CCO)
3) bump openshift/api in your operator and add logic for the feature gate
Currently, hypershift requires its bump to openshift/api in order to set the proper feature gates and this is not preferred. It is preferred that the single place where a api bump is required is cluster-config-operator.
Hypershift should use CCO `render` command to generate the FeatureGate CR
Description of problem:
If we add a configmap to buildconfig as build input, the configmap data is not present at the destnationDir on the build pod.
Version-Release number of selected component (if applicable):
How reproducible:
Follow below steps to reproduce.
Steps to Reproduce:
1. Create a configmap to pass as build input apiVersion: v1 data: settings.xml: |+ xxx yyy kind: ConfigMap metadata: name: build-test namespace: test 2. Create a buidlconfig like below apiVersion: build.openshift.io/v1 kind: BuildConfig metadata: labels: app: custom-build name: custom-build spec: source: configMaps: - configMap: name: build-test destinationDir: /tmp type: None output: to: kind: ImageStreamTag name: custom-build:latest postCommit: {} runPolicy: Serial strategy: customStrategy: from: kind: "DockerImage" name: "registry.redhat.io/rhel8/s2i-base" 3. start a new build oc start-build custom-build 4. As per the documentation[a] the configmap data should present on the build pod location "/var/run/secrets/openshift.io/build" if we didn't explicitly mention the "destinationDir". in above example "destinationDir" set to "/tmp" so "server.xml" file from the configmap should present in "/tmp" directory of the build pod. [a] https://docs.openshift.com/container-platform/4.12/cicd/builds/creating-build-inputs.html#builds-custom-strategy_creating-build-inputs
Actual results:
Configmap data is not present on the "destinationDir" or in default location "/var/run/secrets/openshift.io/build"
Expected results:
Configmap data should be present on the destinationDir of the builder pod.
Additional info:
Description of problem:
As a user when I select the All projects option from the Projects dropdown in the Dev perspective Pipelines pages then the selected option says as undefined.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Steps to Reproduce:
1. Navigate to Pipelines page in the Dev perspective 2. Select the All projects option from the Projects dropdown
Actual results:
Selected option shows as undefined and all Projects list is not shown
Expected results:
Selected option should be All projects and open All projects list page
Additional info:
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
Follow up for https://issues.redhat.com/browse/HOSTEDCP-975
This is to allow multiple tables in a single view with filtering
Description of problem:
IBM VPC CSI Driver failed to provisioning volume in proxy cluster, (if I understand correctly) it seems the proxy in not injected because in our definition (https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/controller.yaml), we are injecting proxy to csi-driver: config.openshift.io/inject-proxy: csi-driver config.openshift.io/inject-proxy-cabundle: csi-driver but the container name is iks-vpc-block-driver in https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/controller.yaml#L153 I checked the proxy in not defined in controller pod or driver container ENV.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-11-055332
How reproducible:
Always
Steps to Reproduce:
1. Create IBM cluster with proxy setting 2. create pvc/pod with IBM VPC CSI Driver
Actual results:
It failed to provisioning volume
Expected results:
Provisioning volume works well on proxy cluster
Additional info:
Description of problem:
When use the command `oc-mirror list operators --catalog=registry.redhat.io/redhat/certified-operator-index:v4.12 -v 9` , at begging the response code is 200 okay , when the command will hang for a while , then will got response code 401.
Version-Release number of selected component (if applicable):
How reproducible:
sometimes
Steps to Reproduce:
Using the advanced cluster management package as an example. 1. oc-mirror list operators --catalog=registry.redhat.io/redhat/certified-operator-index:v4.12 -v 9
Actual results: After hang a while , will got 401 code , seems when timeout the oc-mirror try again forgot to read the credentials
level=debug msg=fetch response received digest=sha256:a67257cfe913ad09242bf98c44f2330ec7e8261ca3a8db3431cb88158c3d4837 mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.accept-ranges=bytes response.header.age=714959 response.header.connection=keep-alive response.header.content-length=80847073 response.header.content-type=binary/octet-stream response.header.date=Mon, 06 Feb 2023 06:52:06 GMT response.header.etag="a428fafd37ee58f4bdeae1a7ff7235b5-1" response.header.last-modified=Fri, 16 Sep 2022 17:54:09 GMT response.header.server=AmazonS3 response.header.via=1.1 010c0731b9775a983eceaec0f5fa6a2e.cloudfront.net (CloudFront) response.header.x-amz-cf-id=rEfKWnJdasWIKnjWhYyqFn9eHY8v_3Y9WwSRnnkMTkPayHlBxWX1EQ== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-storage-class=INTELLIGENT_TIERING response.header.x-amz-version-id=GfqTTjWbdqB0sreyjv3fyo1k6LQ9kZKC response.header.x-cache=Hit from cloudfront response.status=200 OK size=80847073 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:a67257cfe913ad09242bf98c44f2330ec7e8261ca3a8db3431cb88158c3d4837 level=debug msg=fetch response received digest=sha256:d242c7b4380d3c9db3ac75680c35f5c23639a388ad9313f263d13af39a9c8b8b mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.accept-ranges=bytes response.header.age=595868 response.header.connection=keep-alive response.header.content-length=98028196 response.header.content-type=binary/octet-stream response.header.date=Tue, 07 Feb 2023 15:56:56 GMT response.header.etag="f702c84459b479088565e4048a890617-1" response.header.last-modified=Wed, 18 Jan 2023 06:55:12 GMT response.header.server=AmazonS3 response.header.via=1.1 7f5e0d3b9ea85d0d75063a66c0ebc840.cloudfront.net (CloudFront) response.header.x-amz-cf-id=Tw9cjJjYCy8idBiQ1PvljDkhAoEDEzuDCNnX6xJub4hGeh8V0CIP_A== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-storage-class=INTELLIGENT_TIERING response.header.x-amz-version-id=nt7yY.YmjWF0pfAhzh_fH2xI_563GnPz response.header.x-cache=Hit from cloudfront response.status=200 OK size=98028196 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:d242c7b4380d3c9db3ac75680c35f5c23639a388ad9313f263d13af39a9c8b8b level=debug msg=fetch response received digest=sha256:664a8226a152ea0f1078a417f2ec72d3a8f9971e8a374859b486b60049af9f18 mediatype=application/vnd.docker.container.image.v1+json response.header.accept-ranges=bytes response.header.age=17430 response.header.connection=keep-alive response.header.content-length=24828 response.header.content-type=binary/octet-stream response.header.date=Tue, 14 Feb 2023 08:37:35 GMT response.header.etag="57eb6fdca8ce82a837bdc2cebadc3c7b-1" response.header.last-modified=Mon, 13 Feb 2023 16:11:57 GMT response.header.server=AmazonS3 response.header.via=1.1 0c96ded7ff282d2dbcf47c918b6bb500.cloudfront.net (CloudFront) response.header.x-amz-cf-id=w9zLDWvPJ__xbTpI8ba5r9DRsFXbvZ9rSx5iksG7lFAjWIthuokOsA== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-version-id=Enw8mLebn4.ShSajtLqdo4riTDHnVEFZ response.header.x-cache=Hit from cloudfront response.status=200 OK size=24828 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:664a8226a152ea0f1078a417f2ec72d3a8f9971e8a374859b486b60049af9f18 level=debug msg=fetch response received digest=sha256:130c9d0ca92e54f59b68c4debc5b463674ff9555be1f319f81ca2f23e22de16f mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.accept-ranges=bytes response.header.age=829779 response.header.connection=keep-alive response.header.content-length=26039246 response.header.content-type=binary/octet-stream response.header.date=Sat, 04 Feb 2023 22:58:25 GMT response.header.etag="a08688b701b31515c6861c69e4d87ebd-1" response.header.last-modified=Tue, 06 Dec 2022 20:50:51 GMT response.header.server=AmazonS3 response.header.via=1.1 000f4a2f631bace380a0afa747a82482.cloudfront.net (CloudFront) response.header.x-amz-cf-id=S-h31zheAEOhOs6uH52Rpq0ZnoRRdd5VfaqVbZWXzAX-Zym-0XtuKA== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-storage-class=INTELLIGENT_TIERING response.header.x-amz-version-id=BQOjon.COXTTON_j20wZbWWoDEmGy1__ response.header.x-cache=Hit from cloudfront response.status=200 OK size=26039246 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:130c9d0ca92e54f59b68c4debc5b463674ff9555be1f319f81ca2f23e22de16f level=debug msg=do request digest=sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9 mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip request.header.accept=application/vnd.docker.image.rootfs.diff.tar.gzip, */* request.header.range=bytes=13417268- request.header.user-agent=opm/alpha request.method=GET size=91700480 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9 level=debug msg=fetch response received digest=sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9 mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.cache-control=max-age=0, no-cache, no-store response.header.connection=keep-alive response.header.content-length=99 response.header.content-type=application/json response.header.date=Tue, 14 Feb 2023 13:34:06 GMT response.header.docker-distribution-api-version=registry/2.0 response.header.expires=Tue, 14 Feb 2023 13:34:06 GMT response.header.pragma=no-cache response.header.registry-proxy-request-id=0d7ea55f-e96d-4311-885a-125b32c8e965 response.header.www-authenticate=Bearer realm="https://registry.redhat.io/auth/realms/rhcc/protocol/redhat-docker-v2/auth",service="docker-registry",scope="repository:redhat/certified-operator-index:pull" response.status=401 Unauthorized size=91700480 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9.
Expected results:
Should always read the credentials for the command .
Description of problem:
Using openshift-install v4.13.0, no issue messages are displayed to console. Looking at /etc/issue.d/, the issues are sent just not displayed by agetty.
# cat /etc/issue.d/70_agent-services.issue \e{cyan}Waiting for services:\e{reset} [\e{cyan}start\e{reset}] Service that starts cluster installation
Version-Release number of selected component (if applicable):
4.13
How reproducible:
100%
Steps to Reproduce:
1. Build agent image using openshift-install v4.13.0 2. Mount the ISO and boot a machine 3. Wait for a while until issues are created in /etc/issue.d/
Actual results:
No messages are displayed to console
Expected results:
All messages should be displayed
Additional info:
https://redhat-internal.slack.com/archives/C02SPBZ4GPR/p1686646256441329
When changing platform fields e.g. aws instance type we trigger a rolling upgrade, however nothing is signalled in the NodePool state which result in bad UX.
NodePools should signal rolling upgrade because of platform changes.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
The agent installer integration test fails because of the change in the base iso's kargs.json and uses fedora-coreos instead of rhcos. As the integration test uses strict checks using `cmp` function, the test fails because of absence of "coreos.liveiso=fedora-coreos-38.20230609.3.0" in the expected result of the integration test.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Get latest code from master branch 2. Run ./hack/go-integration-test.sh
Actual results:
INFO[2023-09-01T02:23:01Z] --- FAIL: TestAgentIntegration (369.83s)19 --- FAIL: TestAgentIntegration/agent_pxe_configurations (0.00s)20 --- FAIL: TestAgentIntegration/agent_pxe_configurations/sno (49.93s)21 testscript.go:520: # Verify a default configuration for the SNO topology (49.805s)22 > exec openshift-install agent create pxe-files --dir $WORK23 [stderr]24 level=warning msg=CPUPartitioning: is ignored25 level=info msg=Configuration has 1 master replicas and 0 worker replicas26 level=info msg=The rendezvous host IP (node0 IP) is 192.168.111.2027 level=info msg=Extracting base ISO from release payload28 level=info msg=Verifying cached file29 level=info msg=Using cached Base ISO /.cache/agent/image_cache/coreos-x86_64.iso30 level=info msg=Consuming Install Config from target directory31 level=info msg=Consuming Agent Config from target directory32 level=info msg=Created iPXE script agent.x86_64.ipxe in $WORK/pxe directory33 level=info msg=PXE-files created in: $WORK/pxe34 level=info msg=Kernel parameters for PXE boot: coreos.liveiso=fedora-coreos-38.20230609.3.0 ignition.firstboot ignition.platform.id=metal35 > stderr 'Created iPXE script agent.x86_64.ipxe'36 > exists $WORK/pxe/agent.x86_64-initrd.img37 > exists $WORK/pxe/agent.x86_64-rootfs.img38 > exists $WORK/pxe/agent.x86_64-vmlinuz39 > exists $WORK/auth/kubeconfig40 > exists $WORK/auth/kubeadmin-password41 > cmp $WORK/pxe/agent.x86_64.ipxe $WORK/expected/agent.x86_64.ipxe42 diff $WORK/pxe/agent.x86_64.ipxe $WORK/expected/agent.x86_64.ipxe43 --- $WORK/pxe/agent.x86_64.ipxe44 +++ $WORK/expected/agent.x86_64.ipxe45 @@ -1,4 +1,4 @@46 #!ipxe47 initrd --name initrd http://user-specified-pxe-infra.com/agent.x86_64-initrd.img48 -kernel http://user-specified-pxe-infra.com/agent.x86_64-vmlinuz initrd=initrd coreos.live.rootfs_url=http://user-specified-pxe-infra.com/agent.x86_64-rootfs.img coreos.liveiso=fedora-coreos-38.20230609.3.0 ignition.firstboot ignition.platform.id=metal49 +kernel http://user-specified-pxe-infra.com/agent.x86_64-vmlinuz initrd=initrd coreos.live.rootfs_url=http://user-specified-pxe-infra.com/agent.x86_64-rootfs.img ignition.firstboot ignition.platform.id=metal50 boot51 52 FAIL: testdata/agent/pxe/configurations/sno.txt:13: $WORK/pxe/agent.x86_64.ipxe and $WORK/expected/agent.x86_64.ipxe differ
Expected results:
Test should always pass
Additional info:
Description of problem:
The configured accessTokenInactivityTimeout under tokenConfig in HostedCluster doesn't have any effect. 1. The value is not getting updated in oauth-openshift configmap 2. hostedcluster allows user to set accessTokenInactivityTimeout value < 300s, where as in master cluster the value should be > 300s.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Install a fresh 4.13 hypershift cluster 2. Configure accessTokenInactivityTimeout as below: $ oc edit hc -n clusters ... spec: configuration: oauth: identityProviders: ... tokenConfig: accessTokenInactivityTimeout: 100s ... 3. Check the hcp: $ oc get hcp -oyaml ... tokenConfig: accessTokenInactivityTimeout: 1m40s ... 4. Login to guest cluster with testuser-1 and get the token $ oc login https://a8890bba21c9b48d4a05096eee8d4edd-738276775c71fb8f.elb.us-east-2.amazonaws.com:6443 -u testuser-1 -p xxxxxxx $ TOKEN=`oc whoami -t` $ oc login --token="$TOKEN" WARNING: Using insecure TLS client config. Setting this option is not supported! Logged into "https://a8890bba21c9b48d4a05096eee8d4edd-738276775c71fb8f.elb.us-east-2.amazonaws.com:6443" as "testuser-1" using the token provided. You don't have any projects. You can try to create a new project, by running oc new-project <projectname>
Actual results:
1. hostedcluster will allow user to set the value < 300s for accessTokenInactivityTimeout which is not possible on master cluster. 2. The value is not updated in oauth-openshift configmap: $ oc get cm oauth-openshift -oyaml -n clusters-hypershift-ci-25785 ... tokenConfig: accessTokenMaxAgeSeconds: 86400 authorizeTokenMaxAgeSeconds: 300 ... 3. Login doesn't fail even if the user is not active for more than the set accessTokenInactivityTimeout seconds.
Expected results:
Login fails if the user is not active within the accessTokenInactivityTimeout seconds.
Kube 1.26 introduced the warning level TopologyAwareHintsDisabled event. TopologyAwareHintsDisabled is fired by the EndpointSliceController whenever reconciling a service that has activated topology aware hints via the service.kubernetes.io/topology-aware-hints annotation, but there is not enough information in the existing cluster resources (typically nodes) to apply the topology aware hints.
When re-basing OpnShift onto Kube 1.26, are CI builds are failing (except on AWS), because these events are firing "pathologically", for example:
: [sig-arch] events should not repeat pathologically
events happened too frequently event happened 83 times, something is wrong: ns/openshift-dns service/dns-default - reason/TopologyAwareHintsDisabled Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4 result=reject
AWS nodes seem to have the proper values in the nodes. GCP has the values also, but they are not "right" for the purposes of the EndpointSliceController:
event happened 38 times, something is wrong: ns/openshift-dns service/dns-default - reason/TopologyAwareHintsDisabled Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (5 endpoints, 3 zones), addressType: IPv4 result=reject }
https://github.com/openshift/origin/pull/27666 will mask this problem (make it stop erroring in CI) but changes still need to be made in the product so end users are not subjected to these events.
Now links to:
test=[sig-arch] events should not repeat pathologically for namespace openshift-dns
Description of problem:
The DNS egress router must run as privileged. With it being just an haproxy, it doesn't make much sense. If I am not wrong, the biggest reason to need privileged is because of {{chroot}} option inherited from default file (https://github.com/openshift/images/blob/master/egress/dns-proxy/egress-dns-proxy.sh#L44). That option doesn't make much sense when we are already inside a container (hence why ingress controllers don't use it, for example). So it may be worth exploring if this option can be removed and the DNS egress router can be run without requiring privileged mode, but maybe just CAP_NET_BIND_SERVICE
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Always
Steps to Reproduce:
1. Forget to set privileged mode in the container 2. 3.
Actual results:
Pod cannot start due to chroot setting. I need to run the container as privileged, which lowers security too much.
Expected results:
Run the container without being privileged, maybe adding CAP_NET_BIND_SERVICE.
Additional info:
Description of problem:
migrator pod in `openshift-kube-storage-version-migrator` project stuck in Pending state
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. Add a default cluster-wide node selector with a label that do not match with any node label: $ oc edit scheduler cluster apiVersion: config.openshift.io/v1 kind: Scheduler metadata: name: cluster ... spec: defaultNodeSelector: node-role.kubernetes.io/role=app mastersSchedulable: false 2. Delete the migrator pod running in the `openshift-kube-storage-version-migrator` $ oc delete pod migrator-6b78665974-zqd47 -n openshift-kube-storage-version-migrator 3. Check if the migrator pod comes up in running state or not. $ oc get pods -n openshift-kube-storage-version-migrator NAME READY STATUS RESTARTS AGE migrator-6b78665974-j4jwp 0/1 Pending 0 2m41s
Actual results:
The pod goes into the pending state because it tries to get scheduled on the node having label `node-role.kubernetes.io/role=app`.
Expected results:
The pod should come up in running state, it should not get affected by the cluster-wide node-selector.
Additional info:
Setting the annotation `openshift.io/node-selector=` into the `openshift-kube-storage-version-migrator` project and then deleting the pending migrator pod helps in bringing the pod up.
The expectation with this bug is that the project `openshift-kube-storage-version-migrator` should have the annotation `openshift.io/node-selector=`, so that the pod running on this project will not get affected by the wrong cluster-wide node-selector configuration.
Description of problem:
Various jobs are failing in e2e-gcp-operator due to the LoadBalancer-Type Service not going "ready", which means it most likely not getting an IP address. Tests so far affected are: - TestUnmanagedDNSToManagedDNSInternalIngressController - TestScopeChange - TestInternalLoadBalancerGlobalAccessGCP - TestInternalLoadBalancer - TestAllowedSourceRanges For example, in TestInternalLoadBalancer, the Load Balancer never comes back ready: operator_test.go:1454: Expected conditions: map[Admitted:True Available:True DNSManaged:True DNSReady:True LoadBalancerManaged:True LoadBalancerReady:True] Current conditions: map[Admitted:True Available:False DNSManaged:True DNSReady:False Degraded:True DeploymentAvailable:True DeploymentReplicasAllAvailable:True DeploymentReplicasMinAvailable:True DeploymentRollingOut:False EvaluationConditionsDetected:False LoadBalancerManaged:True LoadBalancerProgressing:False LoadBalancerReady:False Progressing:False Upgradeable:True] Where DNSReady:False and LoadBalancerReady:False.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
10% of the time
Steps to Reproduce:
1. Run e2e-gcp-operator many times until you see one of these failures
Actual results:
Test Failure
Expected results:
Not failure
Additional info:
Search.CI Links:
TestScopeChange
TestInternalLoadBalancerGlobalAccessGCP & TestInternalLoadBalancer
This does not seem related to https://issues.redhat.com/browse/OCPBUGS-6013. The DNS E2E tests actually pass this same condition check.
Description of problem:
When we merged https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/229, it changed the way failure domains were injected for Azure so that additional fields could be accounted for. However, the CPMS failure domains have Azure zones as a string (which they should be) and the machine v1beta1 spec has them as a string pointer. This means now that the CPMS is detecting the difference between the a nil zone and an empty string, even though every other piece of code in openshift treats them the same. We should update the machine v1beta1 type to remove the pointer. This will be a no-op in terms of the data stored in etcd since the type is unstructured anyway. It will then require updates to the MAPZ, CPMS, MAO and installer repositories to update their generation.
Version-Release number of selected component (if applicable):
4.14 nightlies from the merge of 229 onwards
How reproducible:
This is only affecting regions in Azure where there are no zones, currently in CI it's affecting about 20% of events.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
node Debug console is not available on all nodes when deploying hypershift on kubevirt using the 'hypershift create cluster kubevirt' default root-volume-size (16 GB).
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc version
Client Version: 4.12.0-0.nightly-2023-04-01-095001
Kustomize Version: v4.5.7
Server Version: 4.12.8
Kubernetes Version: v1.25.7+eab9cc9
happens all the time.
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc debug node/hyper-1-kd7sm
Temporary namespace openshift-debug-5cctb is created for debugging node...
Starting pod/hyper-1-kd7sm-debug ...
To use host binaries, run `chroot /host`
Removing debug pod ...
Temporary namespace openshift-debug-5cctb was removed.
Error from server (BadRequest): container "container-00" in pod "hyper-1-kd7sm-debug" is not available
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc debug node/hyper-1-rkkkm
Temporary namespace openshift-debug-v6xr8 is created for debugging node...
Starting pod/hyper-1-rkkkm-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.128.2.76
If you don't see a command prompt, try pressing enter.
sh-4.4#
1. in the output of :
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc describe node hyper-1-kd7sm
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Sun, 23 Apr 2023 17:27:02 +0300 Sun, 02 Apr 2023 19:45:20 +0300 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Sun, 23 Apr 2023 17:27:02 +0300 Sat, 15 Apr 2023 00:10:46 +0300 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Sun, 23 Apr 2023 17:27:02 +0300 Sun, 02 Apr 2023 19:45:20 +0300 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sun, 23 Apr 2023 17:27:02 +0300 Sun, 02 Apr 2023 19:47:53 +0300 KubeletReady kubelet is posting ready status
2. deploying with a non-default value for --root-volume-size=64 works fine.
3. [root@ocp-edge44 ~]# oc get catalogsource -n openshift-marketplace
NAME DISPLAY TYPE PUBLISHER AGE
certified-operators Certified Operators grpc Red Hat 27h
community-operators Community Operators grpc Red Hat 27h
mce-custom-registry 2.2.4-DOWNANDBACK-2023-04-20-19-04-35 grpc Red Hat 26h
redhat-marketplace Red Hat Marketplace grpc Red Hat 27h
redhat-operators Red Hat Operators grpc Red Hat 27h
As IBM running HCs I want to upgrade an existing 4.12 HC suffering https://issues.redhat.com/browse/OCPBUGS-13639 towards 4.13 and let the private link endpoint to use the right security group.
There's an automated/documented steps for the HC to endup with the endpoint pointing to the right SG.
A possible semi-automated path would be to manually delete and detach the endpoint from the service, so the next reconciliation loop reset status https://github.com/openshift/hypershift/blob/7d24b30c6f79be052404bf23ede7783342f0d0e5/control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go#L410-L444
And the next one would recreate the new endpoint with the right security group https://github.com/openshift/hypershift/blob/7d24b30c6f79be052404bf23ede7783342f0d0e5/control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go#L470-L525
Note this would produce connectivity down time while reconciliation happens.
Alternatively we could codify a path to update the endpoint SG when we detect a discrepancy with the hypershift SG.
Description of problem:
Samples tab is not visible when the Sample Deployment is created. Whereas Snippets tab is visible when added `snippet: true` in the Sample Deployment. Check attached file for exact details.
Version-Release number of selected component (if applicable):
4.11.x
How reproducible:
Always
Steps to Reproduce:
1. On CLI, create the Sample Deployment 2. On Web console, create a Deployment 3. Deployment will be created with details mentioned in Sample Deployment. 4. Samples tab must be visible in YAML view on web console 5. Screenshots are attached for refernec.
Actual results:
When a Sample Deployment is created with the `kind: ConsoleYAMLSample` and `snippet: true`, the snippets tab shows up. When a Sample Deployment is created with a same details but without using `snippet: true`, the "Samples" tab does not show up .
Expected results:
When a Sample Deployment is created with the `kind: ConsoleYAMLSample` and NO `snippet:true`, the "Samples" tab must show up.
Additional info:
When a Sample Deployment is created with the `kind: ConsoleYAMLSample`, the "Samples" tab shows up in OCP cluster version 4.10.x , However it doesn't show up in OCP cluster version 4.11.x . NOTE : Attached file have all the required details.
Description of problem:
OLMv0 over-uses listers and consumes too much memory. Also, $GOMEMLIMIT is not used and the runtime overcommits on RSS. See the following doc for more detail: https://docs.google.com/document/d/11J7lv1HtEq_c3l6fLTWfsom8v1-7guuG4DziNQDU6cY/edit#heading=h.ttj9tfltxgzt
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Currently the 'dump cluster' command requires public access to the guest cluster to dump its contents. It should be possible for it to access the guest cluster via the kube-apiserver service on the mgmt cluster. This would enable it for private clusters as well.
Currently we save in filesystem each installer binary we ever needed in case we users used many different versions pod is reaching storage limit as each binary have ~500mb
We should add ttl to installer cache and remove binaries that are not used
We need to validate that we are able to recover an hosted cluster's etcd (backed by storage such as LVM or HPP) when an underlying management cluster node disappears.
In this scenario, we need to understand what happens when an etcd instance fails, and the underlying PVC is permanently gone. Will the etcd operator be able to detect this and recover? or will the etcd cluster in question remain in a degraded state indefinitely? Those are the types of questions that need answers which will help guide what the next steps are for supporting local storage for etcd.
In the interest of shipping 4.13, we landed a snapshot of nmstate code with some logic for NIC name pinning.
In https://github.com/nmstate/nmstate/commit/03c7b03bd4c9b0067d3811dbbf72635201519356 a few changes were made.
TODO elaborate in this issue what bugs are fixed
This issue is tracking the merge of https://github.com/openshift/machine-config-operator/pull/3685 which was also aiming to ensure 4.14 is compatible.
I recently noticed that cluster-autoscaler pod in the hosted control plane namespace is going continuous restarts. Upon observing the issue, found out liveness and readiness probe failing on this pod.
Also, checking further the logs of this pod, points to rbac missing for the cluster-autoscaler pod in this case. Please see the logs trace for reference.
E0215 14:52:59.936182 1 reflector.go:140] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: agentmachinetemplates.capi-provider.agent-install.openshift.io is forbidden: User "system:serviceaccount:clusters-hcp01:cluster-autoscaler" cannot list resource "agentmachinetemplates" in API group "capi-provider.agent-install.openshift.io" in the namespace "clusters-hcp01"
Description of problem:
Business Automation Operands fail to load in uninstall operator modal. With "Cannot load Operands. There was an error loading operands for this operator. Operands will need to be deleted manually..." alert message. "Delete all operand instances for this operator__checkbox" is not shown so the test fails. https://search.ci.openshift.org/?search=Testing+uninstall+of+Business+Automation+Operator&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Need to follow up HOSTEDCP-1065 with an e2e to test ControlPlaneRelease functionality:
Test should:
`ec2:ReleaseAddress` is documented as a required permission for the NodePool management policy: https://github.com/openshift/hypershift/blob/main/api/v1beta1/hostedcluster_types.go#L1285
This is too permissive and the permission will at least need a condition to scope it. However, it may not be used by the NodePool controller at all. In that case, this permission should be removed.
Done Criteria:
DoD:
Either enforce immutability in the API via cel or add first class support for mutability i.e enable node rollout when changed
This is a clone of issue OCPBUGS-19052. The following is the description of the original issue:
—
Description of problem:
With OCPBUGS-18274 we had to update the etcdctl binary. Unfortunately the script does not attempt to update the binary if it's found in the path already: https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/etcd-common-tools#L16-L24 This causes confusion as the binary might not be the latest that we're shipping with etcd. Pulling the binary shouldn't be a big deal, etcd is running locally anyway and the local image should be cached already just fine. We should always replace the binary
Version-Release number of selected component (if applicable):
any currently supported release
How reproducible:
always
Steps to Reproduce:
1. run cluster-backup.sh to download the binary 2. update the etcd image (take a different version or so) 3. run cluster-backup.sh again
Actual results:
cluster-backup.sh will simply print "etcdctl is already installed"
Expected results:
etcdctl should always be pulled
Additional info:
I have a console extension (https://github.com/gnunn1/dev-console-plugin) that simply adds the Topology and Add+ views to the Admin perspective but otherwise should expose no modules. However if I try to build this extension without an exposedModules the webpack assembly fails with the stack trace below.
As a workaround I'm leaving in the example module from the template and just removing it from being added it to the OpenShift menu.
$ yarn run build main yarn run v1.22.19 $ yarn clean && NODE_ENV=production yarn ts-node node_modules/.bin/webpack $ rm -rf dist $ ts-node -O '\{"module":"commonjs"}' node_modules/.bin/webpack [webpack-cli] HookWebpackError: Called Compilation.updateAsset for not existing filename plugin-entry.js at makeWebpackError (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/HookWebpackError.js:48:9) at /home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:3058:12 at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:41:1) at fn (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:479:17) at _next0 (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:39:1) at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:52:1) at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:13:1) at processTicksAndRejections (node:internal/process/task_queues:95:5) -- inner error -- Error: Called Compilation.updateAsset for not existing filename plugin-entry.js at Compilation.updateAsset (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:4298:10) at /home/gnunn/Development/openshift/dev-console-plugin/node_modules/src/webpack/ConsoleAssetPlugin.ts:82:23 at fn (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:477:10) at _next0 (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:39:1) at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:52:1) at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:13:1) at processTicksAndRejections (node:internal/process/task_queues:95:5) caused by plugins in Compilation.hooks.processAssets Error: Called Compilation.updateAsset for not existing filename plugin-entry.js at Compilation.updateAsset (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:4298:10) at /home/gnunn/Development/openshift/dev-console-plugin/node_modules/src/webpack/ConsoleAssetPlugin.ts:82:23 at fn (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/webpack/lib/Compilation.js:477:10) at _next0 (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:39:1) at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:52:1) at eval (eval at create (/home/gnunn/Development/openshift/dev-console-plugin/node_modules/tapable/lib/HookCodeFactory.js:33:10), <anonymous>:13:1) at processTicksAndRejections (node:internal/process/task_queues:95:5) error Command failed with exit code 2. info Visit {{[https://yarnpkg.com/en/docs/cli/run]}} for documentation about this command. error Command failed with exit code 2. info Visit {{[https://yarnpkg.com/en/docs/cli/run]}} for documentation about this command.
We enabled balance similar node groups via https://issues.redhat.com/browse/OCPBUGS-15769
We should include a validation for this behaviour in our e2e autoscaler testing.
We can probably reused what we do in Machine API test https://github.com/openshift/cluster-api-actuator-pkg/blob/77764237f2e6160d95990dc905b8e87662bc4d16/pkg/autoscaler/autoscaler.go#L437
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.