Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Feature Overview
Insights Advisor for OpenShift is integrated within OpenShift Cluster Manager. This has some limitations for adding new features and also for sharing codebase between RHEL Advisor and OCM Insights Advisor tab. Insights Advisor for OpenShift lacks certain features from the RHEL UI, the codebase is not 1:1 clone.
As a customer of Insights I will have same/very similar user experience with Insights for OpenShift and Insights for RHEL. The workflows will share the main concepts, the UI elements will be same and features introduced to Advisor will be automatically considered for both all supported platforms.
As OpenShift users I will still see integrations of Insights Advisor within OpenShift Cluster Manager that shows aggregated information for customer account and single cluster view on Advisor data. These integration will point to new Insights Advisor for OpenShift app that will be tightly integrated into OpenShift Cluster Manager.
Goals
Requirements
Benefits
Questions to answer...
Out of Scope
Background, and strategic fit
Documentation Considerations
OCP WebConsole, in the main dashboard, has an Insights Advisor widget, which has been redirecting users to OCM. Due to the Insights Advisor tab decommission in OCM, the links should point to Advisor instead.
4.10 code freeze = 28 January (marking the task as urgent)
Problem:
Certain Insights Advisor features differentiate between RHEL and OCP advisor
Goal:
Address top priority UI misalignments between RHEL and OCP advisor. Address UI features dropped from Insights ADvisor for OCP GA.
Scope:
Specific tasks and priority of them tracked in https://issues.redhat.com/browse/CCXDEV-7432
This contains all the Insights Advisor widget deliverables for the OCP release 4.11.
Scope
It covers only minor bug fixes and improvements:
Scenario: Check if the Insights Advisor widget in the OCP WebConsole UI shows the time of the last data analysis Given: OCP WebConsole UI and the cluster dashboard is accessible And: CCX external data pipeline is in a working state And: administrator A1 has access to his cluster's dashboard And: Insights Operator for this cluster is sending archives When: administrator A1 clicks on the Insights Advisor widget Then: the results of the last analysis are showed in the Insights Advisor widget And: the time of the last analysis is shown in the Insights Advisor widget
Acceptance criteria:
max_over_time(timestamp(changes(insightsclient_request_send_total\{status_code="202"}[1m]) > 0)[24h:1m])
Show the error message (mocked in CCXDEV-5868) if the Prometheus metrics `cluster_operator_conditions{name="insights"}` contain two true conditions: UploadDegraded and Degraded at the same time. This state occurs if there was an IO archive upload error = problems with the pipeline.
Expected for 4.11 OCP release.
Cloning the existing rule should end up with a new rule in the same namespace.
Modifications can now be done to the new rule.
(Optional) You can silence the existing rule.
Create a new PrometheusRule object inside the namespace that includes the metrics you need to form the alerting rule.
CMO should reconcile the platform Prometheus configuration with the alert-relabel-config resources.
DoD
CMO should reconcile the platform Prometheus configuration with the AlertingRule resources.
DoD
Managing PVs at scale for a fleet creates difficulties where "one size does not fit all". The ability for SRE to deploy prometheus with PVs and have retention based an on a desired size would enable easier management of these volumes across the fleet.
The prometheus-operator exposes retentionSize.
Field | Description |
---|---|
retentionSize | Maximum amount of disk space used by blocks. Supported units: B, KB, MB, GB, TB, PB, EB. Ex: 512MB. |
This is a feature request to enable this configuration option via CMO cluster-monitoring-config ConfigMap.
Today, all configuration for setting individual, for example, routing configuration is done via a single configuration file that only admins have access to. If an environment uses multiple tenants and each tenant, for example, has different systems that they are using to notify teams in case of an issue, then someone needs to file a request w/ an admin to add the required settings.
That can be bothersome for individual teams, since requests like that usually disappear in the backlog of an administrator. At the same time, administrators might get tons of requests that they have to look at and prioritize, which takes them away from more crucial work.
We would like to introduce a more self service approach whereas individual teams can create their own configuration for their needs w/o the administrators involvement.
Last but not least, since Monitoring is deployed as a Core service of OpenShift there are multiple restrictions that the SRE team has to apply to all OSD and ROSA clusters. One restriction is the ability for customers to use the central Alertmanager that is owned and managed by the SRE team. They can't give access to the central managed secret due to security concerns so that users can add their own routing information.
Provide a new API (based on the Operator CRD approach) as part of the Prometheus Operator that allows creating a subset of the Alertmanager configuration without touching the central Alertmanager configuration file.
Please note that we do not plan to support additional individual webhooks with this work. Customers will need to deploy their own version of the third party webhooks.
Team A wants to send all their important notifications to a specific Slack channel.
As described in https://github.com/openshift/enhancements/blob/ba3dc219eecc7799f8216e1d0234fd846522e88f/enhancements/monitoring/multi-tenant-alerting.md#distinction-between-platform-and-user-alerts, cluster admins want to distinguish platform alerts from user alerts. For this purpose, CMO should provision an external label (openshift_io_alert_source="platform") on prometheus-k8s instances.
* CI - CI is running, tests are automated and merged.
* Release Enablement <link to Feature Enablement Presentation>
* DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
* DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
* DEV - Downstream build attached to advisory: <link to errata>
* QE - Test plans in Polarion: <link or reference to Polarion>
* QE - Automated tests merged: <link or reference to automated tests>
* DOC - Downstream documentation merged: <link to meaningful PR>
Now that upstream supports AlertmanagerConfig v1beta1 (see MON-2290 and https://github.com/prometheus-operator/prometheus-operator/pull/4709), it should be deployed by CMO.
DoD:
DoD
DoD
Copy/paste from [_https://github.com/openshift-cs/managed-openshift/issues/60_]
Which service is this feature request for?
OpenShift Dedicated and Red Hat OpenShift Service on AWS
What are you trying to do?
Allow ROSA/OSD to integrate with AWS Managed Prometheus.
Describe the solution you'd like
Remote-write of metrics is supported in OpenShift but it does not work with AWS Managed Prometheus since AWS Managed Prometheus requires AWS SigV4 auth.
Describe alternatives you've considered
There is the workaround to use the "AWS SigV4 Proxy" but I'd think this is not properly supported by RH.
https://mobb.ninja/docs/rosa/cluster-metrics-to-aws-prometheus/
Additional context
The customer wants to use an open and portable solution to centralize metrics storage and analysis. If they also deploy to other clouds, they don't want to have to re-configure. Since most clouds offer a Prometheus service (or it's easy to self-manage Prometheus), app migration should be simplified.
The cluster monitoring operator should allow OpenShift customers to configure remote write with all authentication methods supported by upstream Prometheus.
We will extend CMO's configuration API to support the following authentications with remote write:
Customers want to send metrics to AWS Managed Prometheus that require sigv4 authentication (see https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-secure-metric-ingestion.html#AMP-secure-auth).
Prometheus and Prometheus operator already support custom Authorization for remote write. This should be possible to configure the same in the CMO configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
Authorization:
type: Bearer
credentials:
name: credentials
key: token
DoD:
Prometheus and Prometheus operator already support sigv4 authentication for remote write. This should be possible to configure the same in the CMO configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
sigv4:
accessKey:
name: aws-credentialss
key: access
secretKey:
name: aws-credentials
key: secret
profile: "SomeProfile"
roleArn: "SomeRoleArn"
DoD:
As WMCO user, I want to make sure containerd logging information has been updated in documents and scripts.
Configure audit logging to capture login, logout and login failure details
TODO(PM): update this
Customer who needs login, logout and login failure details inside the openshift container platform.
I have checked for this on my test cluster but the audit logs do not contain any user name specifying login or logout details. For successful logins or logout, on CLI and openshift console as well we can see 'Login successful' or 'Invalid credentials'.
Expected results: Login, logout and login failures should be captured in audit logging.
The apiserver pods today have ´/var/log/<kube|oauth|openshift>-apiserver` mounted from the host and create audit files there using the upstream audit event format (JSON lines following https://github.com/kubernetes/apiserver/blob/92392ef22153d75b3645b0ae339f89c12767fb52/pkg/apis/audit/v1/types.go#L72). These events are apiserver specific, but as oauth authentication flow events are also requests, we can use the apiserver event format to log logins, login failures and logouts. Hence, we propose to make oauth-server to create /var/log/oauth-server/audit.log files on the master nodes using that format.
When the login flow does not finish within a certain time (e.g. 10min), we can artificially create an event to show a login failure in the audit logs.
Right now there's no way to generate audit logs from this.
Right now there's no way to generate audit logs from this.
Let the Cluster Authentication Operator deliver the policy to OAuthServer.
In order to know if authn events should be logged, OAuthServer needs to be aware of it.
* Stanislav LázničkaCreate an observer to deliver the audit policy to the oauth server
Make the authentication-operator react to the new audit field in the oauth.config/cluster object. Write an observer watching this field, such an observer will translate the top-level configuration into oauth-server config and add it to the rest of the observed config.
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Early customer feedback is that they see SNO as a great solution covering smaller footprint deployment, but are wondering what is the evolution story OpenShift is going to provide where more capacity or high availability are needed in the future.
While migration tooling (moving workload/config to new cluster) could be a mid-term solution, customer desire is not to include extra hardware to be involved in this process.
For Telecommunications Providers, at the Far Edge they intend to start small and then grow. Many of these operators will start with a SNO-based DU deployment as an initial investment, but as DUs evolve, different segments of the radio spectrum are added, various radio hardware is provisioned and features delivered to the Far Edge, the Telecommunication Providers desire the ability for their Far Edge deployments to scale up from 1 node to 2 nodes to n nodes. On the opposite side of the spectrum from SNO is MMIMO where there is a robust cluster and workloads use HPA.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This is a ticket meant to track all the all the OCP PRs that are involved in the implementation of the SNO + workers enhancement
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Rebase openshift/builder to k8s 1.24
4.11 MVP Requirements
Out of scope use cases (that are part of the Kubeframe/factory project):
Questions to be addressed:
As a deployer, I want to be able to:
so that I can achieve
Currently the Assisted Service generates the credentials by running the ignition generation step of the oepnshift-installer. This is why the credentials are only retrievable from the REST API towards the end of the installation.
In the BILLI usage, which takes down assisted service before the installation is complete there is no obvious point at which to alert the user that they should retrieve the credentials. This means that we either need to:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
The AWS-specific code added in OCPPLAN-6006 needs to become GA and with this we want to introduce a couple of Day2 improvements.
Currently the AWS tags are defined and applied at installation time only and saved in the infrastructure CRD's status field for further operator use, which in turn just add the tags during creation.
Saving in the status field means it's not included in Velero backups, which is a crucial feature for customers and Day2.
Thus the status.resourceTags field should be deprecated in favour of a newly created spec.resourceTags with the same content. The installer should only populate the spec, consumers of the infrastructure CRD must favour the spec over the status definition if both are supplied, otherwise the status should be honored and a warning shall be issued.
Being part of the spec, the behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the AWS resources should be updated accordingly.
On AWS this can be done without re-creating any resources (the behaviour is basically an upsert by tag key) and is possible without service interruption as it is a metadata operation.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
After that, we can remove the experimental flag and make this a GA feature.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
RFE-1101 described user defined tags for AWS resources provisioned by an OCP cluster. Currently user can define tags which are added to the resources during creation. These tags cannot be updated subsequently. The propagation of the tags is controlled using experimental flag. Before this feature goes GA we should define and implement a mechanism to exclude any experimental flags. Day2 operations and deletion of tags is not in the scope.
RFE-2012 aims to make the user-defined resource tags feature GA. This means that user defined tags should be updatable.
Currently the user-defined tags during install are passed directly as parameters of the Machine and Machineset resources for the master and worker. As a result these tags cannot be updated by consulting the Infrastructure resource of the cluster where the user defined tags are written.
The MCO should be changed such that during provisioning the MCO looks up the values of the tags in the Infrastructure resource and adds the tags during creation of the EC2 resources. The MCO should also watch the infrastructure resource for changes and when the resource tags are updated it should update the tags on the EC2 instances without restarts.
Acceptance Criteria:
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Feature --->
<--- Remove the descriptive text as appropriate --->
Problem
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Running the OPCT with the latest version (v0.1.0) on OCP 4.11.0, the openshift-tests is reporting an incorrect counter for the "total" field.
In the example below, after the 1127th test, the total follows the same counter of executed. I also would assume that the total is incorrect before that point as the test continues the execution increases both counters.
openshift-tests output format: [failed/executed/total]
started: (0/1126/1127) "[sig-storage] PersistentVolumes-expansion loopback local block volume should support online expansion on node [Suite:openshift/conformance/parallel] [Suite:k8s]" passed: (38s) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options [Suite:openshift/conformance/parallel] [Suite:k8s]" started: (0/1127/1127) "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]" passed: (6.6s) 2022-08-09T17:12:21 "[sig-storage] Downward API volume should provide container's memory request [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]" started: (0/1128/1128) "[sig-storage] In-tree Volumes [Driver: cinder] [Testpattern: Dynamic PV (immediate binding)] topology should fail to schedule a pod which has topologies that conflict with AllowedTopologies [Suite:openshift/conformance/parallel] [Suite:k8s]" skip [k8s.io/kubernetes@v1.24.0/test/e2e/storage/framework/testsuite.go:116]: Driver local doesn't support GenericEphemeralVolume -- skipping Ginkgo exit error 3: exit with code 3 skipped: (400ms) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]" started: (0/1129/1129) "[sig-storage] In-tree Volumes [Driver: emptydir] [Testpattern: Dynamic PV (default fs)] capacity provides storage capacity information [Suite:openshift/conformance/parallel] [Suite:k8s]"
OPCT output format [executed/total (failed failures)]
Tue, 09 Aug 2022 14:12:13 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1112/1127 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:23 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1120/1127 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:33 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1139/1139 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:43 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1185/1185 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:53 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1188/1188 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor...
Enable sharing ConfigMap and Secret across namespaces
Requirement | Notes | isMvp? |
---|---|---|
Secrets and ConfigMaps can get shared across namespaces | YES |
NA
NA
Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them.
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a developer using OpenShift
I want to mount a Simple Content Access certificate into my build
So that I can access RHEL content within a Docker strategy build.
As a application developer or administrator
I want to share credentials across namespaces
So that I don't need to copy credentials to every workspace
As a cluster admin
I want the cluster storage operator to install the shared resources CSI driver
So that I can test the shared resources CSI driver on my cluster
Docs will need to identify how to install the shared resources CSI driver (by enabling the tech preview feature set)
Tasks:
Note that to be able to test all of this on any cloud provider, we need STOR-616 to be implemented. We can work around this by making the CSI driver installable on AWS or GCP for testing purposes.
The cluster storage operator has cluster-admin permissions. However, no other CSI driver managed by the operator includes a CRD for its API.
As an OpenShift engineer
I want to know which clusters are using the Shared Resource CSI Driver
So that I can be proactive in supporting customers who are using this tech preview feature
None - metrics exported to telemetry are not formally documented.
QE can verify that the query/recording rule for cluster monitoring operator returns data if the cluster has the Shared Resource CSI driver installed and utilizes a SharedSecret or SharedConfigMap in a pod/workload.
Insights rules can potentially be created off of these exported metrics. This would allow CEE to identify which clusters are using SharedSecrets or SharedConfigMaps, especially if we are exporting mount failure metrics.
To implement, a prometheus query/recording rule needs to be added to the cluster monitoring operator. Once approved by the monitoring team, the metric data will be available on DataHub once 4.10 clusters are installed with the updated version of the monitoring operator.
This Feature is a general "catch all" for the time being. There are a number of existing priorities from Q1 that should be aligned with existing priorities below but if not, assign to this feature as needed.
In order to get a better overall portfolio view, we'll leverage this Feature to gather work that doesn't fall into other existing priorities on this board. As this list grows, the portfolio priority grooming team will look to split out or handle appropriately.
A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
requirement | Notes | isMvp |
---|---|---|
< How will the user interact with this feature? >
< Which users will use this and when will they use it? >
< Is this feature used as part of current user interface? >
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
<What does success look like?>
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>
<If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>
Question | Outcome |
Console provides support UI for operators which is dynamically enabled when the operator is installed; by using feature flags against presence of CRDs. While operators have their own release cadence separately from OpenShift which makes for alignment of UI to API difficult. As new features are released for the operator, the UI becomes out of sync with APIs and customers must wait till the following OpenShift release to get any new UI.
Console extensions:
https://docs.google.com/document/d/1HW5_cl6cOX5P14PQN-1_8c60o9dMY6HbFDRftH6aTno/edit
Dynamic Plugins:
https://docs.google.com/document/d/19BAFo_8BtMZVvKsU-bE61bZpSydeYONkCMWntMU9NgE/edit
Enhancement proposal:
https://github.com/openshift/enhancements/pull/441
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.
There are definitely grey areas, but in general:
Questions to be addressed:
User Story: As a customer in a highly regulated environment, I need the ability to secure DNS traffic when forwarding requests to upstream resolvers so that I can ensure additional DNS traffic and data privacy.
Create a PR in openshift/cluster-ingress-operator to implement configurable router probe timeouts.
The PR should include the following:
tldr: three basic claims, the rest is explanation and one example
While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.
One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.
I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.
We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.
Relevant links:
Per the 4.6.30 Monitoring DNS Post Mortem, we should add E2E tests to openshift/cluster-dns-operator to reduce the risk that changes to our CoreDNS configuration break DNS resolution for clients.
To begin with, we add E2E DNS testing for 2 or 3 client libraries to establish a framework for testing DNS resolvers; the work of adding additional client libraries to this framework can be left for follow-up stories. Two common libraries are Go's resolver and glibc's resolver. A somewhat common library that is known to have quirks is musl libc's resolver, which uses a shorter timeout value than glibc's resolver and reportedly has issues with the EDNS0 protocol extension. It would also make sense to test Java or other popular languages or runtimes that have their own resolvers.
Additionally, as talked about in our DNS Issue Retro & Testing Coverage meeting on Feb 28th 2024, we also decided to add a test for testing a non-EDNS0 query for a larger than 512 byte record, as once was an issue in bug OCPBUGS-27397.
The ultimate goal is that the test will inform us when a change to OpenShift's DNS or networking has an effect that may impact end-user applications.
In OCP 4.8 the router was changed to use the "random" balancing algorithm for non-passthrough routes by default. It was previously "leastconn".
Bug https://bugzilla.redhat.com/show_bug.cgi?id=2007581 shows that using "random" by default incurs significant memory overhead for each backend that uses it.
PR https://github.com/openshift/cluster-ingress-operator/pull/663
reverted the change and made "leastconn" the default again (OCP 4.8 onwards).
The analysis in https://bugzilla.redhat.com/show_bug.cgi?id=2007581#c40 shows that the default haproxy behaviour is to multiply the weight (specified in the route CR) by 16 as it builds its data structures for each backend. If no weight is specified then openshift-router sets the weight to 256. If you have many, many thousands of routes then this balloons quickly and leads to a significant increase in memory usage, as highlighted by customer cases attached to BZ#2007581.
The purpose of this issue is to both explore changing the openshift-router default weight (i.e., 256) to something smaller, or indeed unset (assuming no explicit weight has been requested), and to measure the memory usage within the context of the existing perf&scale tests that we use for vetting new haproxy releases.
It may be that the low-hanging change is to not default to weight=256 for backends that only have one pod replica (i.e., if no value specified, and there is only 1 pod replica, then don't default to 256 for that single server entry).
Outcome: does changing the [default] weight value make it feasible to switch back to "random" as the default balancing algorithm for a future OCP release.
Revert router to using "random" once again in 4.11 once analysis is done on impact of weight and static memory allocation.
Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.
The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:
Requirement | Notes | isMvp? |
---|---|---|
UI to enable and disable plugins | YES | |
Dynamic Plugin Framework in place | YES | |
Testing Infra up and running | YES | |
Docs and read me for creating and testing Plugins | YES | |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Documentation Considerations
Questions to be addressed:
Currently, webpack tree shakes PatternFly and only includes the components used by console in its vendor bundle. We need to expose all of the core PatternFly components for use in dynamic plugin, which means we have to disable tree shaking for PatternFly. We should expose this as a separate bundle. This will allow browsers to cache more efficiently and only need to load the PF bundle again when we upgrade PatternFly.
Open Questions
What parts of PatternFly do we consider core?
Acceptance Criteria
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
When viewing the Installed Operators list set to 'All projects' and then selecting an operator that is available in 'All namespaces' (globally installed,) upon clicking the operator to view its details the user is taken into the details of that operator in installed namespace (project selector will switch to the install namespace.)
This can be disorienting then to look at the lists of custom resource instances and see them all blank, since the lists are showing instances only in the currently selected project (the install namespace) and not across all namespaces the operator is available in.
It is likely that making use of the new Operator resource will improve this experience (CONSOLE-2240,) though that may still be some releases away. it should be considered if it's worth a "short term" fix in the meantime.
Note: The informational alert was not implemented. It was decided that since "All namespaces" is displayed in the radio button, the alert was not needed.
Goal
Add support for PDB (Pod Disruption Budget) to the console.
Requirements:
Designs:
During master nodes upgrade when nodes are getting drained there's currently no protection from two or more operands going down. If your component is required to be available during upgrade or other voluntary disruptions, please consider deploying PDB to protect your operands.
The effort is tracked in https://issues.redhat.com/browse/WRKLDS-293.
Example:
Acceptance Criteria:
1. Create PDB controller in console-operator for both console and downloads pods
2. Add e2e tests for PDB in single node and multi node cluster
Note: We should consider to backport this to 4.10
As a user, I want the ability to run a pod in debug mode.
This should be the equivalent of running: oc debug pod
Acceptance Criteria for MVP
Assets
Designs (WIP): https://docs.google.com/document/d/1b2n9Ox4xDNJ6AkVsQkXc5HyG8DXJIzU8tF6IsJCiowo/edit#
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Goal
Add the ability to choose between a full cluster upgrade (which exists today) or control plane upgrade (which will pause all worker pools) in the console.
Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes.
Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc. It is all at a pool level, they will not be able to choose specific hosts.
Requirements
Design deliverables:
Goal
Improve the UX on the machine config pool page to reflect the new enhancements on the cluster settings that allows users to select the ability to update the control plane only.
Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes.
Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc. It is all at a pool level, they will not be able to choose specific hosts.
Requirements
Design deliverables:
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Feature Overview
Enable customers to access Google services from workloads on OpenShift clusters using Google Workload Identity (aka WIF)
https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Dependencies (internal and external)
We need to ensure following things in the openshift operators
1) Make sure to operator uses v0.0.0-20210218202405-ba52d332ba99 or later version of the golang.org/x/oauth2 module
2) Mount the oidc token in the operator pod, this needs to go in the deployment. We have done it for cluster-image-registry-operator here
3) For workload identity to work, gco credentials that the operator pod uses should be of external_account type (not service_account). The external_account credentials type have path to oidc token along, url of the service account to impersonate along with other details. These type of credentials can be generated from gcp console or programmatically (supported by ccoctl). The operator pod can then consume it from a kube secret. Make appropriate code changes to the operators so that can consume these new credentials
Following repos need one or more of above changes
Upstream Kuberenetes is following other SIGs by moving it's intree cloud providers to an out of tree plugin format, Cloud Controller Manager, at some point in a future Kubernetes release. OpenShift needs to be ready to action this change
Bring together all the cloud controller managers (AWS, GCP, Azure), complete testing and prepare for final GA
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Initial work was started there: https://github.com/lobziik/cluster-cloud-controller-manager-operator/pull/1/files
Need to isolate provider specific code in respective packages and introduce interface to leverage it (regular and bootstrap manifests rendering should be there atm)
DoD:
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
https://issues.redhat.com/browse/AUTH-2 revealed that, in prinicipal, Pod Security Admission is possible to integrate into OpenShift while retaining SCC functionality.
This epic is about the concrete steps to enable Pod Security Admission by default in OpenShift
Enhancement - https://github.com/openshift/enhancements/pull/1010
dns-operator must comply to restricted pod security level. The current audit warning is:
{ "objectRef": "openshift-dns-operator/deployments/dns-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unre stricted capabilities (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.runAsNonRoot=tr ue), seccompProfile (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }ingress-operator must comply to pod security. The current audit warning is:
{ "objectRef": "openshift-ingress-operator/deployments/ingress-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.run AsNonRoot=true), seccompProfile (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
Update console from Cypress 6.0.0 to 8.5.0. Changes that impact us:
https://docs.cypress.io/guides/references/migration-guide#Migrating-to-Cypress-8-0
As an adopter of the @openshift-console/dynamic-plugin-sdk I want to easily integrate into my development pipeline so that I can extend the OCP console.
Trying to pull in the dynamic-plugin-sdk into ACM is proving to be problematic. We would have to move to older dependencies. Integrating with webpack and typescript requires a very specific setup.
The dynamic-plugin-sdk has only really been used internally by OCP and is strongly tied to the setup and dependencies of OCP. For the dynamic-plugin-sdk to be externally consumable by adopters, it should be as easy to use as other webpack plugins such as HtmlWebpackPlugin or CompressionPlugin.
The console has many instances of old variables, $grid-float-breakpoint and $grid-gutter-width, controlling margins/padding and responsive breakpoints throughout the Admin and Dev Console. These do not provide spacing and behaviors consistent with Patternfly components which use their own variables, $pf-global-gutter-md, $pf-global-gutter, and $pf-global-breakpoint-{size}. By replacing these, the intent it to bring the console closer to a pure Patternfly structure and behavior, requiring less overrides and customizations.
Update webpack to the latest 4.x and update webpack loaders. This will help prepare us to move to webpack 5.
HyperShift provisions OpenShift clusters with externally managed control-planes. It follows a slightly different process for provisioning clusters. For example, HyperShift uses cluster API as a backend and moves all the machine management bits to the management cluster.
showing machine management/cluster auto-scaling tabs in the console is likely to confuse users and cause unnecessary side effects.
See Design Doc: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
It's based on the SERVER_FLAG controlPlaneTopology being set to External is really the driving factor here; this can be done in one of two ways:
To test work related to cluster upgrade process, use a 4.10.3 cluster set on the candidate-4.10 upgrade channel using 4.11 frontend code.
Based on Cesar's comment we should be removing the `Control Plane` section, if the infrastructure.status.controlplanetopology being "External".
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to remove the ability to “Add identity providers” under “Set up your Cluster”. In addition to the getting started card, we should remove the ability to update a cluster on the details card when applicable (anything that changes a cluster version should be read only).
Summary of changes to the overview page:
Check section 03 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend kubeadmin notifier, from the global notifications, since it contain link for updating the cluster OAuth configuration (see attachment).
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend these notifications:
For these we will need to check `ControlPlaneTopology`, if it's set to 'External' and also check if the user can edit cluster version(either by creating a hook or an RBAC call, eg. `canEditClusterVersion`)
Check section 05 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need surface a message that the control plane is externally managed and add following changes:
In general, anything that changes a cluster version should be read only.
Check section 02 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
PatternFly Dark Theme Handbook: https://docs.google.com/document/d/1mRYEfUoOjTsSt7hiqjbeplqhfo3_rVDO0QqMj2p67pw/edit
Admin Console -> Workloads & Pods
Dev Console -> Gotcha pages: Observe Dashboard and Metrics, Add, Pipelines: builder, list, log, and run
As a developer, I want to be able to scope the changes needed to enable dark mode for the admin console. As such, I need to investigate how much of the console will display dark mode using PF variables and also define a list of gotcha pages/components which will need special casing above and beyond PF variable settings.
Acceptance criteria:
As a developer, I want to be able to fix remaining issues from the spreadsheet of issues generated after the initial pass and spike of adding dark theme to the console.. As such, I need to make sure to either complete all remaining issues for the spreadsheet, or, create a bug or future story for any remaining issues in these two documents.
Acceptance criteria:
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
The Cluster Dashboard Details Card Protractor integration test was failing at high rate, and despite multiple attempts to fix, was never fully resolved, so it was disabled as a way to fix https://bugzilla.redhat.com/show_bug.cgi?id=2068594. Migrating this entire file to Cypress should give us better debugging capability, which is what was done to fix a similarly problematic project dashboard Protractor test.
In the 4.11 release, a console.openshift.io/default-i18next-namespace annotation is being introduced. The annotation indicates whether the ConsolePlugin contains localization resources. If the annotation is set to "true", the localization resources from the i18n namespace named after the dynamic plugin (e.g. plugin__kubevirt), are loaded. If the annotation is set to any other value or is missing on the ConsolePlugin resource, localization resources are not loaded.
In case these resources are not present in the dynamic plugin, the initial console load will be slowed down. For more info check BZ#2015654
AC:
Follow up of https://issues.redhat.com/browse/CONSOLE-3159
We need to provide a base for running integration tests using the dynamic plugins. The tests should initially
Once the basic framework is in place, we can update the demo plugin and add new integration tests when we add new extension points.
https://github.com/openshift/console/tree/master/frontend/dynamic-demo-plugin
https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md
https://github.com/openshift/console/tree/master/frontend/packages/console-plugin-sdk
Currently, enabled plugins can fail to load for a variety of reasons. For instance, plugins don't load if the plugin name in the manifest doesn't match the ConsolePlugin name or the plugin has an invalid codeRef. There is no indication in the UI that something has gone wrong. We should explore ways to report this problem in the UI to cluster admins. Depending on the nature of the issue, an admin might be able to resolve the issue or at least report a bug against the plugin.
The message about failing could appear in the notification drawer and/or console plugins tab on the operator config. We could also explore creating an alert if a plugin is failing.
AC:
We have a Timestamp component for consistent display of dates and times that we should expose through the SDK. We might also consider a hook that formats dates and times for places were you don't want or cant use the component, eg. times on a chart.
This will become important when we add a user preference for dates so that plugins show consistent dates and times as console. If I set my user preference to UTC dates, console should show UTC dates everywhere.
AC:
Currently, you need to navigate to
Cluster Settings ->
Global configuration ->
Console (operator) config ->
Console plugins
to see and managed plugins. This takes a lot of clicks and is not discoverable. We should look at surfacing plugin details where they're easier to find – perhaps on the Cluster Settings page – or at least provide a more convenient link somewhere in the UI.
AC: Add the Dynamic Plugins section to the Status Card in the overview that will contain:
Goal
Background
RFE: for 4.10, Cincinnati and the cluster-version operator are adding conditional updates (a.k.a. targeted edge blocking): https://issues.redhat.com/browse/OTA-267
High-level plans in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#update-client-support-for-the-enhanced-schema
Example of what the oc adm upgrade UX will be in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#cluster-administrator.
The oc implementation landed via https://github.com/openshift/oc/pull/961.
Design
See design doc: https://docs.google.com/document/d/1Nja4whdsI5dKmQNS_rXyN8IGtRXDJ8gXuU_eSxBLMIY/edit#
See marvel: https://marvelapp.com/prototype/h3ehaa4/screen/86077932
Update the cluster settings page to inform the user when the latest available update is supported but not recommended. Add an informational popover to the latest version in update path visualization.
The "Update Version" modal on the cluster settings page should be updated to give users information about recommended, not recommended, and blocked update versions.
In the image-registry, we have packages origin-common and kubernetes-common. The problem is that this code doesn't get updates. We can replace them with more supported library-go.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Story: As an administrator I want to rely on a default configuration that spreads image registry pods across topology zones so that I don't suffer from a long recovery time (>6 mins) in case of a complete zone failure if all pods are impacted.
Background: The image registry currently uses affinity/anti-affinity rules to spread registry pods across different hosts. However this might cause situations in which all pods end up on hosts of a single zone, leading to a long recovery time of the registry if that zone is lost entirely. However due to problems in the past with the preferred setting of anti-affinity rule adherence the configuration was forced instead with required and the rules became constraints. With zones as constraints the internal registry would not have deployed anymore in environments with a single zone, e.g. internal CI environment. Pod topology constraints is a new API that is supported in OCP which can also relax constraints in case they cannot be satisfied. Details here: https://docs.openshift.com/container-platform/4.7/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html
Acceptance criteria:
Open Questions:
As an OpenShift administrator
I want to provide the registry operator with a custom certificate authority for S3 storage
so that I can use a third-party S3 storage provider.
Remove Jenkins from the OCP Payload.
See epic linking - need alternative non payload image available to provide relatively seamless migration
Also, the EP for this is approved and merged at https://github.com/openshift/enhancements/blob/master/enhancements/builds/remove-jenkins-payload.md
PARTIAL ANSWER ^^: confirmed with Ben Parees in https://coreos.slack.com/archives/C014MHHKUSF/p1646683621293839 that EP merging is currently sufficient OCP "technical leadership" approval.
assuming none
As maintainers of the OpenShift jenkins component, we need run Jenkins CI for PR testing against openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin, openshift/jenkins-openshift-login-plugin, using images built in the CI pipeline but not injected into CI test clusters via sample operator overriding the jenkins sample imagestream with the jenkins payload image.
As maintainers of the OpenShift Jenkins component, we need Jenkins periodics for the client and sync plugins to run against the latest non payload, CPaas image, promoted to CI's image locations on quay.io, for the current release in development.
As maintainers of the OpenShift Jenkins component, we need Jenkins related tests outside of very basic Jenkins Pipieline Strategy Build Config verification, removed from openshift-tests in OpenShift Origin, using a non-payload, CPaas image pertinent to the branch in question.
High Level, we ideally want to vet the new CPaas image via CI and periodics BEFORE we start changing the samples operator so that it does not manipulate the jenkins imagestream (our tests will override the samples operator override)
NONE ... QE should wait until JNKS-254
NONE
NONE
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Possible staging
1) before CPaas is available, we can validate images generated by PRs to openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin by taking the image built by the image (where the info needed to get the right image from the CI registry is in the IMAGE_FORMAT env var) and then doing an `oc tag --source=docker <PR image ref> openshift/jenkins:2` to replace the use of the payload image in the jenkins imagestream in the openshift namespace with the PRs image
2) insert 1) in https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/sync-plugin/e2e/jenkins-sync-plugin-e2e-commands.sh and https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/client-plugin/tests/jenkins-client-plugin-tests-commands.sh where you test for IMAGE_FORMAT being set
3) or instead of 2) you update the Makefiles for the plugins to call a script that does the same sort of thing, see what is in IMAGE_FORMAT, and if it has something, do the `oc tag`
https://github.com/openshift/release/pull/26979 is a prototype of how to stick the image built from a PR and conceivably the periodics to get the image built from it and tag it into the jenkins imagestream in the openshift namespace in the test cluster
After installing or upgrading to the latest OCP version, the existing OpenShift route to the prometheus-k8s service is updated to be a path-based route to '/api/v1'.
DoD:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
After investigating a complex Bugzilla involving many applications making queries to prometheus-adapter, we've noticed that we were lacking insights on the requests made to prometheus-adapter. To have such information for an aggregated API, the best would be to have audit logs for prometheus-adapter. This wasn't configurable before, but with https://github.com/kubernetes-sigs/custom-metrics-apiserver/pull/92, upstream users should now be able to configure it.
Since this would greatly help in investigating prometheus-adapter Bugzilla in the future, it would be great if we allowed OpenShift users to configure the audit logs so that they could provide them to us.
Note for the assignee: as of the time of the creation of this ticket, the upstream PR hasn't been merged in custom-metrics-apiserver and thus wasn't synced in prometheus-adapter. So we will have to wait a bit before starting looking into this ticket.
DoD:
Following up on https://issues.redhat.com/browse/MON-1320, we added three new CLI flags to Prometheus to apply different limits on the samples' labels. These new flags are available starting from Prometheus v2.27.0, which will most likely be shipped in OpenShift 4.9.
The limits that we want to look into for OCP are the following ones:
# Per-scrape limit on number of labels that will be accepted for a sample. If # more than this number of labels are present post metric-relabeling, the # entire scrape will be treated as failed. 0 means no limit. [ label_limit: <int> | default = 0 ] # Per-scrape limit on length of labels name that will be accepted for a sample. # If a label name is longer than this number post metric-relabeling, the entire # scrape will be treated as failed. 0 means no limit. [ label_name_length_limit: <int> | default = 0 ] # Per-scrape limit on length of labels value that will be accepted for a sample. # If a label value is longer than this number post metric-relabeling, the # entire scrape will be treated as failed. 0 means no limit. [ label_value_length_limit: <int> | default = 0 ]
We could benefit from them by setting relatively high values that could only induce unbound cardinality and thus reject the targets completely if they happened to breach our constrainst.
DoD:
When users configure CMO to interact with systems outside of an OpenShift cluster, we want to provide an easy way to add the cluster ID to the data send.
Technically this can be achieved today, by adding an identifying label to the remote_write configuration for a given cluster. The operator adding the remote_write integration needs to take care that the label is unique over the managed fleet of clusters. This however adds management complexity. Any given cluster already has a pseudo-unique datum, that can be used for this purpose.
Expose a flag in the CMO configuration, that is false by default (keeps backward compatibility) and when set to true will add the _id label to a remote_write configuration. More specifically it will be added to the top of a remote_write relabel_config list via the replace action. This will add the label as expect, but additionally a user could alter this label in a later relabel config to suit any specific requirements (say rename the label or add additional information to the value).
The location of this flag is the remote_write Spec, so this can be set for individual remote_write configurations.
We currently use a sample app to e2e test remote write in CMO.
In order to test the addition of the cluster_id relabel config, we need to confirm that the metrics send actually have the expected label.
For this test we should use Prometheus as the remote_write target. This allows us to query the metrics send via remote write and confirm they have the expected label.
Add an optional boolean flag to CMOs definition of RemoteWriteSpec that if true adds an entry in the specs WriteRelabelConfigs list.
I went with adding the relabel config to all user-supplied remote_write configurations. This path has no risk for backwards compatibility (unless users use the {}tmp_openshift_cluster_id{} label, seems unlikely) and reduces overall complexity, as well as documentation complexity.
The entry should look like what is already added to the telemetry remote write config and it should be added as the first entry in the list, before any user supplied relabel configs.
The potential target ServiceMonitors are:
The console requires to know the network type capabilities to show/hide some Network Policy form fields.
As a result of https://issues.redhat.com/browse/NETOBSERV-27, this logic is implemented as a features document inside the console code. The console fetches the network type from the network operator and checks the supported features towards this document.
However, this limits the feature to admin users, as other logged-in users do not have permissions to fetch the network type.
This task aims to modify the current Cluster Network Operator to expose the network capabilities as an `sdn-public` Config Map, writeable only by the SDN, readable by any `system:authenticated` user.
Enhancement Proposal PR: https://github.com/openshift/enhancements/pull/875
We want to configure 'default' and 'allowed' values in validation webhook for Guest Accelerators field in GCPProviderSpec. Also revendor it to include newly added Guest Accelerators field.
This can be done after https://github.com/openshift/cluster-api-provider-gcp/pull/172 is merged.
DoD:
Description:
Openshift on RHV is composed of the following subproject the team maintains:
Each of those projects currently uses the generated oVirt API project go-ovirt.
This leads to a number of issues:
Then came go-ovirt-client, go-ovirt-client-log, go-ovirt-client-log-klog and k8sOVirtCredentialsMonitor to the rescue!
The go-ovirt-client is a wrapper around the go-ovirt which contains all the error handling/retry logic/logs/tests needed to provide a decent user experience and an easy-to-use API to the oVirt engine.
go-ovirt-client-log is a library to unify the logging logic between the projects, it is used by go-ovirt-client and should be used by all the sub-projects.
go-ovirt-client-log-klog is a companion library to go-ovirt-client-log enabling logging via the Kubernetes "klog" facility.
k8sOVirtCredentialsMonitor is a utility for monitoring the oVirt credentials secret, which will automatically update the ovirt credentials is they are changed.
We aim to move all projects which are using the go-ovirt to use go-ovirt-client, go-ovirt-client-log and k8sOVirtCredentialsMonitor instead.
Benefits for the eng:
Benefits for the customers:
Acceptance criteria:
How to test:
Description:
Acceptance:
ovirt-csi-driver uses go-ovirt-client for 95% percent of all oVirt related logic.
As a user, I want to understand which service bindings connected a service to a component successfully or not. Currently it's really difficult to understand and needs inspection into each ServiceBinding resource (yaml).
See also https://docs.google.com/document/d/1OzE74z2RGO5LPjtDoJeUgYBQXBSVmD5tCC7xfJotE00/edit
As a user, I want the topology view to be less cluttered as I doom out showing only information that I can discern and still be able to get a feel for the status of my project.
T-shirt size: M
Provide an easy and successful experience for front end developers to build and deploy their applications
Currently, the front end dev experience is not positive. It's much easier for them to use other platforms. Improving the front end dev experience will enable us to gain more marketshare
Although we provide the ability for 2 & 3 today, the current journey does not match with the mental model of the front end developer
Desired UX experience
As a user, I want have the option to add additional labels to a Route, as I could do in OCP3. See RFE-622
The additional labels should only be added to the route, not the service or other components. The advanced option "Labels" should not be touched and these labels are added to all components.
As an small additional we should also show always the "Target port" since it also defines the Service port and to make this more clear, the "Target port" should be shown before the "Create a route to the Application" checkbox.
The following changes should be applied to the Import flow (from Git, from Container, ...) and to the Edit page as well:
This epic is mainly focused on the 4.10 Release QE activities
1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis
To the track the QE progress at one place in 4.10 Release Confluence page
Acceptance criteria:
There are different code spots which maps the old action items "From Git", "From Dockerfile" and "From Devfile" to the new action "Import from Git".
We should avoid mapping different strings to the new version and instead update our tests so that the feature and page object files matches the latest frontend code.
Code areas I found are marked with
// TODO (ODC-6455): Tests should use latest UI labels like "Import from Git" instead of mapping strings
This epic covers a number of customer requests(RFEs) as well as increases usability.
Customer satisfaction as well as improved usability.
None
As a user, I want to use a form to create Deployments
Edit deployment form ODC-5007
As a user, I should be able to switch between the form and yaml editor while creating the ProjectHelmChartRepository CR.
Form component https://github.com/openshift/console/pull/11227
Currently we are only able to get limited telemetry from the Dev Sandbox, but not from any of our managed clusters or on prem clusters.
In order to improve properly analyze usage and the user experience, we need to be able to gather as much data as possible.
// JS type
telemetry?: Record<string, string>
./bin/bridge --telemetry SEGMENT_API_KEY=a-key-123-xzy ./bin/bridge --telemetry CONSOLE_LOG=debug
Goal:
Enhance oc adm release new (and related verbs info, extract, mirror) with heterogeneous architecture support
tl;dr
oc adm release new (and related verbs info, extract, mirror) would be enhanced to optionally allow the creation of manifest list release payloads. The manifest list flow would be triggered whenever the CVO image in an imagestream was a manifest list. If the CVO image is a standard manifest, the generated release payload will also be a manifest. If the CVO image is a manifest list, the generated release payload would be a manifest list (containing a manifest for each arch possessed by the CVO manifest list).
In either case, oc adm release new would permit non-CVO component images to be manifest or manifest lists and pass them through directly to the resultant release manifest(s).
If a manifest list release payload is generated, each architecture specific release payload manifest will reference the same pullspecs provided in the input imagestream.
More details in Option 1 of https://docs.google.com/document/d/1BOlPrmPhuGboZbLZWApXszxuJ1eish92NlOeb03XEdE/edit#heading=h.eldc1ppinjjh
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
Please read: migrating-protractor-tests-to-cypress
Protractor test to migrate: `frontend/integration-tests/tests/oauth.scenario.ts`
Large but straight forward
47) OAuth 48) BasicAuth IDP ✔ creates a Basic Authentication IDP ✔ shows the BasicAuth IDP on the OAuth settings page 49) GitHub IDP ✔ creates a GitHub IDP ✔ shows the GitHub IDP on the OAuth settings page 50) GitLab IDP ✔ creates a GitLab IDP ✔ shows the GitLab IDP on the OAuth settings page 51) Google IDP ✔ creates a Google IDP ✔ shows the Google IDP on the OAuth settings page 52) Keystone IDP ✔ creates a Keystone IDP ✔ shows the Keystone IDP on the OAuth settings page 53) LDAP IDP ✔ creates a LDAP IDP ✔ shows the LDAP IDP on the OAuth settings page 54) OpenID IDP ✔ creates a OpenID IDP ✔ shows the OpenID IDP on the OAuth settings page
Accpetance Criteria
I asked Zvonko Kaiser and he seemed open to it. I need to confirm with Shiva Merla
Rename Provider to Infrastructure Provider
Add GPU Provider
https://miro.com/app/board/uXjVOeUB2B4=/?moveToWidget=3458764514332229879&cot=14
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a developer building container images on OpenShift
I want to specify that my build should run without elevated privileges
So that builds do not run as root from the host's perspective with elevated privileges
No QE required for Dev Preview. OpenShift regression testing will verify that existing behavior is not impacted.
We will need to document how to enable this feature, with sufficient warnings regarding Dev Preview.
This likely warrants an OpenShift blog post, potentially?
As a follow up to OCPCLOUD-693, we need to, once all of the API definitions are present in openshift/api, migrate the existing code bases to use the new API locations.
This will include:
Complete all the 4.9 epic features automation user stories and merge it to master branch.
4.9 epics automation completion
Tech debt should be completed
Create the pr's for 4.9 epic user stories automation
Review it
Merge it to 4.10 master branch and 4.9 master branch
As a user, I want to store my delivery pipelines in a Git repository as the source of truth and execute the pipeline on OpenShift on Git events, so that I can version and trace changes to the delivery pipelines in Git.
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
This is a clone of issue OCPBUGS-6018. The following is the description of the original issue:
—
This is a public clone of OCPBUGS-3821
The MCO can sometimes render a rendered-config in the middle of an upgrade with old MCs, e.g.:
This will cause the render controller to create a new rendered MC that uses the OLD kubeletconfig-MC, which at best is a double reboot for 1 node, and at worst block the update and break maxUnavailable nodes per pool.
The two modules that are auto generated for the CLI docs need to add ":_content-type: REFERENCE" to the top of the files. Update the doc generation templates to add these.
Just like kube proxy, ovnk should expose port 10256 on every node, so that cloud LBs can send health checks and know which nodes are available. This is relevant for services with externalTrafficPolicy=Cluster.
The current integration of prometheus-adapter in OpenShift uses the platform Prometheus as a backend to get metrics. The problem with this design is that we are getting metrics from 2 different Prometheus instances which don't have replicated data, so two queries sent at the same time to prometheus-adapter might yield different results since the underlying promQL queries executed by prometheus-adapter might be on different Prometheus servers. The consequence is that we end up having inconsistent data across multiple autoscaling requests.
This can be easily tested by running:
$ while true ; do date; oc adm top pod -n openshift-monitoring prometheus-k8s-0 ; echo; sleep 1 ;done Mon Jul 26 03:55:07 EDT 2021 NAME CPU(cores) MEMORY(bytes) prometheus-k8s-0 208m 4879Mi Mon Jul 26 03:55:08 EDT 2021 NAME CPU(cores) MEMORY(bytes) prometheus-k8s-0 246m 4877Mi Mon Jul 26 03:55:09 EDT 2021 NAME CPU(cores) MEMORY(bytes) prometheus-k8s-0 208m 4879Mi Mon Jul 26 03:55:10 EDT 2021 NAME CPU(cores) MEMORY(bytes) prometheus-k8s-0 246m 4877Mi
This isn't a bug in itself since it was designed that way, but we could do better by using thanos-querier as a backend instead of the platform Prometheus because it will duplicate the metrics from both instances and serve one consistent result based on the data that it will get from the Prometheuses.
DoD:
4.12 will have an option in cri-o: add_inheritable_capabilities which will allow a user to opt-out of dropping inheritable capabilities (which comes as a fix for CVE-2022-27652). We should add it by default as a drop-in in 4.11 so clusters that upgrade from it inherit the old behavior
Description of problem:
In a 4.11 cluster with only openshift-samples enabled, the 4.12 introduced optional COs console and insights are installed. While upgrading to 4.12, CVO considers them to be disabled explicitly and skips reconciling them. So these COs are not upgraded to 4.12. Installed COs cannot be disabled, so CVO is supposed to implicitly enable them. $ oc get clusterversion -oyaml { "apiVersion": "config.openshift.io/v1", "kind": "ClusterVersion", "metadata": { "creationTimestamp": "2022-09-30T05:02:31Z", "generation": 3, "name": "version", "resourceVersion": "134808", "uid": "bd95473f-ffda-402d-8fe3-74f852a9d6eb" }, "spec": { "capabilities": { "additionalEnabledCapabilities": [ "openshift-samples" ], "baselineCapabilitySet": "None" }, "channel": "stable-4.11", "clusterID": "8eda5167-a730-4b39-be1d-214a80506d34", "desiredUpdate": { "force": true, "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "version": "" } }, "status": { "availableUpdates": null, "capabilities": { "enabledCapabilities": [ "openshift-samples" ], "knownCapabilities": [ "Console", "Insights", "Storage", "baremetal", "marketplace", "openshift-samples" ] }, "conditions": [ { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-09-28-204419 not found in the \"stable-4.11\" channel", "reason": "VersionNotFound", "status": "False", "type": "RetrievedUpdates" }, { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Capabilities match configured spec", "reason": "AsExpected", "status": "False", "type": "ImplicitlyEnabledCapabilities" }, { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Payload loaded version=\"4.12.0-0.nightly-2022-09-28-204419\" image=\"registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc\" architecture=\"amd64\"", "reason": "PayloadLoaded", "status": "True", "type": "ReleaseAccepted" }, { "lastTransitionTime": "2022-09-30T05:23:18Z", "message": "Done applying 4.12.0-0.nightly-2022-09-28-204419", "status": "True", "type": "Available" }, { "lastTransitionTime": "2022-09-30T07:05:42Z", "status": "False", "type": "Failing" }, { "lastTransitionTime": "2022-09-30T07:41:53Z", "message": "Cluster version is 4.12.0-0.nightly-2022-09-28-204419", "status": "False", "type": "Progressing" } ], "desired": { "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "version": "4.12.0-0.nightly-2022-09-28-204419" }, "history": [ { "completionTime": "2022-09-30T07:41:53Z", "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "startedTime": "2022-09-30T06:42:01Z", "state": "Completed", "verified": false, "version": "4.12.0-0.nightly-2022-09-28-204419" }, { "completionTime": "2022-09-30T05:23:18Z", "image": "registry.ci.openshift.org/ocp/release@sha256:5a6f6d1bf5c752c75d7554aa927c06b5ea0880b51909e83387ee4d3bca424631", "startedTime": "2022-09-30T05:02:33Z", "state": "Completed", "verified": false, "version": "4.11.0-0.nightly-2022-09-29-191451" } ], "observedGeneration": 3, "versionHash": "CSCJ2fxM_2o=" } } $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-09-28-204419 True False False 93m cloud-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h56m cloud-credential 4.12.0-0.nightly-2022-09-28-204419 True False False 3h59m cluster-autoscaler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h53m config-operator 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m console 4.11.0-0.nightly-2022-09-29-191451 True False False 3h45m control-plane-machine-set 4.12.0-0.nightly-2022-09-28-204419 True False False 117m csi-snapshot-controller 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m dns 4.12.0-0.nightly-2022-09-28-204419 True False False 3h53m etcd 4.12.0-0.nightly-2022-09-28-204419 True False False 3h52m image-registry 4.12.0-0.nightly-2022-09-28-204419 True False False 3h46m ingress 4.12.0-0.nightly-2022-09-28-204419 True False False 151m insights 4.11.0-0.nightly-2022-09-29-191451 True False False 3h48m kube-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h50m kube-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h51m kube-scheduler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h51m kube-storage-version-migrator 4.12.0-0.nightly-2022-09-28-204419 True False False 91m machine-api 4.12.0-0.nightly-2022-09-28-204419 True False False 3h50m machine-approver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m machine-config 4.12.0-0.nightly-2022-09-28-204419 True False False 3h52m monitoring 4.12.0-0.nightly-2022-09-28-204419 True False False 3h44m network 4.12.0-0.nightly-2022-09-28-204419 True False False 3h55m node-tuning 4.12.0-0.nightly-2022-09-28-204419 True False False 113m openshift-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h48m openshift-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 113m openshift-samples 4.12.0-0.nightly-2022-09-28-204419 True False False 116m operator-lifecycle-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h48m service-ca 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m storage 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-28-204419
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.11 cluster with only openshift-samples enabled 2. Upgrade to 4.12 3.
Actual results:
The 4.12 introduced optional CO console and insights are not upgraded to 4.12
Expected results:
All the installed COs get upgraded
Additional info:
This bug is a backport clone of [Bugzilla Bug 2118318](https://bugzilla.redhat.com/show_bug.cgi?id=2118318). The following is the description of the original bug:
—
+++ This bug was initially created as a clone of Bug #2117569 +++
Description of problem:
The garbage collector resource quota controller must ignore ALL events; otherwise, if a rogue controller or a workload causes unbound event creation, performance will degrade as it has to process the events.
Fix: https://github.com/kubernetes/kubernetes/pull/110939
This bug is to track fix in master (4.12) and also allow to backport to 4.11.1
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
— Additional comment from Michal Fojtik on 2022-08-11 10:52:28 UTC —
I'm using FastFix here as we need to backport this to 4.11.1 to avoid support churn for busy clusters or clusters doing upgrades.
— Additional comment from ART BZ Bot on 2022-08-11 15:13:32 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.
— Additional comment from zhou ying on 2022-08-12 03:03:34 UTC —
checked the payload commit id , the payload 4.12.0-0.nightly-2022-08-11-191750 has container the fixed pr .
oc adm release info registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-08-11-191750 --commit-urls |grep hyperkube
Warning: the default reading order of registry auth file will be changed from "${HOME}/.docker/config.json" to podman registry config locations in the future version. "${HOME}/.docker/config.json" is deprecated, but can still be used for storing credentials as a fallback. See https://github.com/containers/image/blob/main/docs/containers-auth.json.5.md for the order of podman registry config locations.
hyperkube https://github.com/openshift/kubernetes/commit/da80cd038ee5c3b45ba36d4b48b42eb8a74439a3
commit da80cd038ee5c3b45ba36d4b48b42eb8a74439a3 (HEAD -> master, origin/release-4.13, origin/release-4.12, origin/master, origin/HEAD)
Merge: a9d6306a701 055b96e614a
Author: OpenShift Merge Robot <openshift-merge-robot@users.noreply.github.com>
Date: Thu Aug 11 15:13:05 2022 +0000
Merge pull request #1338 from benluddy/openshift-pick-110888
Bug 2117569: UPSTREAM: 110888: feat: fix a bug thaat not all event be ignored by gc controller
We have created a fix in 4.12 that fetches instance type information from Azure API instead of updating the lists. We feel that backporting that fix is too risky, but agreed to update the list in older versions.
Description of problem:
Add the following instance types to azure_instance_types list[1]:
Version-Release number of selected component (if applicable):
OCP 4.8
Steps to Reproduce:
1. Migrate worker/infra nodes to above mentioned (missing) v5 instance types
2. "Failed to set autoscaling from zero annotations, instance type unknown"
Actual results:
Expected results:
The new instance types are available in the azure_instance_types list[1] and no errors/warnings are observed after migrating:
Additional info:
The related v4 instance types are already available[1] - I suspect adding the mentioned v5 instance types is a minor update:
1) azure_instance_types.go
https://github.com/openshift/cluster-api-provider-azure/blob/release-4.8/pkg/cloud/azure/actuators/machineset/azure_instance_types.go
Description of problem:
When creating a pod with an additional network that contains a `spec.config.ipam.exclude` range, any address within the excluded range is still iterated while searching for a suitable IP candidate. As a result, pod creation times out when large exclude ranges are used.
Version-Release number of selected component (if applicable):
How reproducible:
with big exclude ranges, 100%
Steps to Reproduce:
1. create network-attachment-definition with a large range: $ cat <<EOF| oc apply -f - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: nad-w-excludes spec: config: |- { "cniVersion": "0.3.1", "name": "macvlan-net", "type": "macvlan", "master": "ens3", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "fd43:01f1:3daa:0baa::/64", "exclude": [ "fd43:01f1:3daa:0baa::/100" ], "log_file": "/tmp/whereabouts.log", "log_level" : "debug" } } EOF 2. create a pod with the network attached: $ cat <<EOF|oc apply -f - apiVersion: v1 kind: Pod metadata: name: pod-with-exclude-range annotations: k8s.v1.cni.cncf.io/networks: nad-w-excludes spec: containers: - name: pod-1 image: openshift/hello-openshift EOF 3. check pod status, event log and whereabouts logs after a while: $ oc get pods NAME READY STATUS RESTARTS AGE pod-with-exclude-range 0/1 ContainerCreating 0 2m23s $ oc get events <...> 6m39s Normal Scheduled pod/pod-with-exclude-range Successfully assigned default/pod-with-exclude-range to <worker-node> 6m37s Normal AddedInterface pod/pod-with-exclude-range Add eth0 [10.129.2.49/23] from openshift-sdn 2m39s Warning FailedCreatePodSandBox pod/pod-with-exclude-range Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded $ oc debug node/<worker-node> - tail /host/tmp/whereabouts.log Starting pod/<worker-node>-debug ... To use host binaries, run `chroot /host` 2022-10-27T14:14:50Z [debug] Finished leader election 2022-10-27T14:14:50Z [debug] IPManagement: {fd43:1f1:3daa:baa::1 ffffffffffffffff0000000000000000} , <nil> 2022-10-27T14:14:59Z [debug] Used defaults from parsed flat file config @ /etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.conf 2022-10-27T14:14:59Z [debug] ADD - IPAM configuration successfully read: {Name:macvlan-net Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[fd43:01f1:3daa:0baa::/80] DNS: {Nameservers:[] Domain: Search:[] Options:[]} Range:fd43:1f1:3daa:baa::/64 RangeStart:fd43:1f1:3daa:baa:: RangeEnd:<nil> GatewayStr: EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LeaderLeaseDuration:1500 LeaderRenewDeadline:1000 LeaderRetryPeriod:500 LogFile:/tmp/whereabouts.log LogLevel:debug OverlappingRanges:true SleepForRace:0 Gateway:<nil> Kubernetes: {KubeConfigPath:/etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:PodName:pod-with-exclude-range PodNamespace:default} 2022-10-27T14:14:59Z [debug] Beginning IPAM for ContainerID: f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 2022-10-27T14:14:59Z [debug] Started leader election 2022-10-27T14:14:59Z [debug] OnStartedLeading() called 2022-10-27T14:14:59Z [debug] Elected as leader, do processing 2022-10-27T14:14:59Z [debug] IPManagement - mode: 0 / containerID:f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 / podRef: default/pod-with-exclude-range 2022-10-27T14:14:59Z [debug] IterateForAssignment input >> ip: fd43:1f1:3daa:baa:: | ipnet: {fd43:1f1:3daa:baa:: ffffffffffffffff0000000000000000} | first IP: fd43:1f1:3daa:baa::1 | last IP: fd43:1f1:3daa:baa:ffff:ffff:ffff:ffff
Actual results:
Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Expected results:
additional network gets attached to the pod
Additional info:
Description of problem:
This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2074299 for backporting purposes.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When creating a ProjectHelmChartRepository (with or without the form) and setting a display name (as `spec.name`), this value is not used in the developer catalog / Helm Charts catalog filter sidebar.
It shows (and watches) the display names of `HelmChartRepository` resources.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Always
Steps to Reproduce:
1. Switch to Developer Perspective 2. Navigate to Add > "Helm Chart repositories" 3. Enter "ibm-charts" as "Chart repository name" 4. Enter URL https://raw.githubusercontent.com/IBM/charts/master/repo/community/index.yaml as URL) 5. Press on create 6. Open the YAML editor and change the `spec.name` attribute to "IBM Charts" 7. Save the change 8. Navigate to Add > "Helm Chart"
Actual results:
The filter navigation on the left side shows "Chart Repositories" "Ibm Chart". A camel case version of the resource name.
Expected results:
It should show the "spec.name" "IBM Charts" if defined and fallback to the current implementation if the optional spec.name is not defined.
Additional info:
There is a bug discussing that the display name could not be entered directly, https://bugzilla.redhat.com/show_bug.cgi?id=2106366. This bug here is only about the catalog output.
OCPBUGS-1251 landed an admin-ack gate in 4.11.z to help admins prepare for Kubernetes 1.25 API removals which are coming in OpenShift 4.12. Poking around in a 4.12.0-ec.2 cluster where APIRemovedInNextReleaseInUse is firing:
$ oc --as system:admin adm must-gather -- /usr/bin/gather_audit_logs $ zgrep -h v1beta1/poddisruptionbudget must-gather.local.1378724704026451055/quay*/audit_logs/kube-apiserver/*.log.gz | jq -r '.verb + " " + (.user | .username + " " + (.extra["authentication.kubernetes.io/pod-name"] | tostr ing))' | sort | uniq -c parse error: Invalid numeric literal at line 29, column 6 28 watch system:serviceaccount:openshift-machine-api:cluster-autoscaler ["cluster-autoscaler-default-5cf997b8d6-ptgg7"]
Finding the source for that container:
$ oc --as system:admin -n openshift-machine-api get -o json pod cluster-autoscaler-default-5cf997b8d6-ptgg7 | jq -r '.status.containerStatuses[].image' quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed $ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed | grep github SOURCE_GIT_URL=https://github.com/openshift/kubernetes-autoscaler io.openshift.build.commit.url=https://github.com/openshift/kubernetes-autoscaler/commit/1dac0311b9842958ec630273428b74703d51c1c9 io.openshift.build.source-location=https://github.com/openshift/kubernetes-autoscaler
Poking about in the source:
$ git clone --depth 30 --branch master https://github.com/openshift/kubernetes-autoscaler.git
$ cd kubernetes-autoscaler
$ find . -name vendor
./addon-resizer/vendor
./cluster-autoscaler/vendor
./vertical-pod-autoscaler/e2e/vendor
./vertical-pod-autoscaler/vendor
Lots of vendoring. I haven't checked to see how new the client code is in the various vendor packages. But the main issue seems to be the v1beta1 in:
$ git grep policy cluster-autoscaler/core cluster-autoscaler/utils | grep policy.*v1beta1 cluster-autoscaler/core/scaledown/actuation/actuator_test.go: policyv1beta1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/actuation/actuator_test.go: eviction := createAction.GetObject().(*policyv1beta1.Eviction) cluster-autoscaler/core/scaledown/actuation/drain.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/actuation/drain_test.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/legacy/legacy.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/legacy/wrapper.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/scaledown.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/static_autoscaler_test.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/drain/drain.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/drain/drain_test.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/kubernetes/listers.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/kubernetes/listers.go: v1policylister "k8s.io/client-go/listers/policy/v1beta1"
The main change from v1beta1 to v1 involves spec.selector; I dunno if that's relevant to the autoscaler use-case or not.
Do we run autoscaler CI? I was poking around a bit, but did not find a 4.12 periodic excercising the autoscaler that might have turned up this alert and issue.
This is a clone of issue OCPBUGS-2508. The following is the description of the original issue:
—
Description of problem:
Installer fails due to Neutron policy error when creating Openstack servers for OCP master nodes. $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-kwtf8-master-0 Running 23h openshift-machine-api ostest-kwtf8-master-1 Running 23h openshift-machine-api ostest-kwtf8-master-2 Running 23h openshift-machine-api ostest-kwtf8-worker-0-g7nrw Provisioning 23h openshift-machine-api ostest-kwtf8-worker-0-lrkvb Provisioning 23h openshift-machine-api ostest-kwtf8-worker-0-vwrsk Provisioning 23h $ oc -n openshift-machine-api logs machine-api-controllers-7454f5d65b-8fqx2 -c machine-controller [...] E1018 10:51:49.355143 1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Failed to create port err: Request forbidden: [POST https://overcloud.redhat.local:13696/v2.0/ports], error message: {\"NeutronError\": {\"type\": \"PolicyNotAuthorized\", \"message\": \"(rule:create_port and (rule:create_port:allowed_address_pairs and (rule:create_port:allowed_address_pairs:ip_address and rule:create_port:allowed_address_pairs:ip_address))) is disallowed by policy\", \"detail\": \"\"}}" "name"="ostest-kwtf8-worker-0-lrkvb" "namespace"="openshift-machine-api"
Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-10-14-023020
How reproducible:
Always
Steps to Reproduce:
1. Install 4.10 within provider networks (in primary or secondary interface)
Actual results:
Installation failure: 4.10.0-0.nightly-2022-10-14-023020: some cluster operators have not yet rolled out
Expected results:
Successful installation
Additional info:
Please find must-gather for installation on primary interface link here and for installation on secondary interface link here.
Acceptance criteria:
This is a clone of issue OCPBUGS-2083. The following is the description of the original issue:
—
Description of problem:
Currently we are running VMWare CSI Operator in OpenShift 4.10.33. After running vulnerability scans, the operator was discovered to be running a known weak cipher 3DES. We are attempting to upgrade or modify the operator to customize the ciphers available. We were looking at performing a manual upgrade via Quay.io but can't seem to pull the image and was trying to steer away from performing a custom install from scratch. Looking for any suggestions into mitigated the weak cipher in the kube-rbac-proxy under VMware CSI Operator.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-7960. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7780. The following is the description of the original issue:
—
Description of problem:
4.9 and 4.10 oc calls to oc adm upgrade channel ... for 4.11+ clusters would clear spec.capabilities. Not all that many clusters try to restrict capabilities, but folks will need to bump their channel for at least every other minor (if their using EUS channels), and while we recommend folks use an oc from the 4.y they're heading towards, we don't have anything in place to enforce that.
Version-Release number of selected component (if applicable):
4.9 and 4.10 oc are exposed vs. the new-in-4.11 spec.capabilities. Newer oc could theoretically be exposed vs. any new ClusterVersion spec capabilities.
How reproducible:
100%
Steps to Reproduce:
1. Install a 4.11+ cluster with None capabilities.
2. Set the channel with a 4.10.51 oc, like oc adm upgrade channel fast-4.11.
3. Check the capabilities with oc get -o json clusterversion version | jq -c .spec.capabilities.
Actual results:
null
Expected results:
{"baselineCapabilitySet":"None"}
This is a clone of issue OCPBUGS-6517. The following is the description of the original issue:
—
Description of problem:
When the cluster is configured with Proxy the swift client in the image registry operator is not using the proxy to authenticate with OpenStack, so it's unable to reach the OpenStack API. This issue became evident since recently the support was added to not fallback to cinder in case swift is available[1].
[1]https://github.com/openshift/cluster-image-registry-operator/pull/819
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Deploy a cluster with proxy and restricted installation 2. 3.
Actual results:
Expected results:
Additional info:
This fix contains the following changes coming from updated version of kubernetes up to v1.24.10:
Changelog:
v1.24.11: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v12410
v1.24.10: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1249
v1.24.9: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1248
v1.24.8: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1247
v1.24.7: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1246
Description of problem:
The metal3 Pod of openshift-machine-api is in CrashLoopBackOff status.
Version-Release number of selected component (if applicable):
4.10.31
How reproducible:
Always reproduce in IPv6
Steps to Reproduce:
1. Preparing to configure ipi on the provisioning node - RHEL 8 ( haproxy, named, mirror registry, rhcos_cache_server ..) 2. configuring the install-config.yaml (attached) - provisioningNetwork: disabled - machine network: only IPv6 - disconnected installation 3. deploy the cluster
Actual results:
It is not possible to add worker nodes because metal3 does not start normally.
Expected results:
metal3 starts normally in IPv6 environment
Additional info:
1. attached must-gather https://drive.google.com/file/d/1GKxj3syROIMnURx_PYzOYhJdEuXBNXVW/view?usp=sharing 2. pod status [kni@prov ~]$ oc get pod NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-6656bfd7b9-bt4j8 2/2 Running 0 35h cluster-baremetal-operator-6bbdd6758-rmxgq 2/2 Running 0 35h machine-api-controllers-55fb545b56-kl5sj 7/7 Running 0 34h machine-api-operator-845b6cf855-q7gdd 2/2 Running 0 35h metal3-574876cfdb-98fmz 6/7 CrashLoopBackOff 179 (3m17s ago) 14h metal3-image-cache-5mq2w 1/1 Running 0 14h metal3-image-cache-nftpj 1/1 Running 0 14h metal3-image-cache-t7whh 1/1 Running 0 14h metal3-image-customization-68d4d6d99b-dbqgn 1/1 Running 0 15h [kni@prov ~]$ oc logs metal3-574876cfdb-98fmz -c metal3-httpd AH00526: Syntax error on line 8 of /etc/httpd/conf.d/vmedia.conf: The port number "2001:feed:101::102" is outside the appropriate range (i.e., 1..65535).
Description of problem:
[OVN][OSP] After reboot egress node, egress IP cannot be applied anymore.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-11-07-181244
How reproducible:
Frequently happened in automation. But didn't reproduce it in manual.
Steps to Reproduce:
1. Label one node as egress node 2. Config one egressIP object STEP: Check one EgressIP assigned in the object. Nov 8 15:28:23.591: INFO: egressIPStatus: [{"egressIP":"192.168.54.72","node":"huirwang-1108c-pg2mt-worker-0-2fn6q"}] 3. Reboot the node, wait for the node ready.
Actual results:
EgressIP cannot be applied anymore. Waited more than 1 hour. oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-47031 192.168.54.72
Expected results:
The egressIP should be applied correctly.
Additional info:
Some logs E1108 07:29:41.849149 1 egressip.go:1635] No assignable nodes found for EgressIP: egressip-47031 and requested IPs: [192.168.54.72] I1108 07:29:41.849288 1 event.go:285] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egressip-47031", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'NoMatchingNodeFound' no assignable nodes for EgressIP: egressip-47031, please tag at least one node with label: k8s.ovn.org/egress-assignable W1108 07:33:37.401149 1 egressip_healthcheck.go:162] Could not connect to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107): context deadline exceeded I1108 07:33:37.401348 1 master.go:1364] Adding or Updating Node "huirwang-1108c-pg2mt-worker-0-2fn6q" I1108 07:33:37.437465 1 egressip_healthcheck.go:168] Connected to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107)
After this log, seems like no logs related to "192.168.54.72" happened.
As per [1], the jsonnet code for managing thanos-ruler resources should reuse the upstream kube-thanos project.
This is a clone of issue OCPBUGS-1765. The following is the description of the original issue:
—
Description of problem:
If a customer creates a machine with a networks section like this networks: - filter: {} noAllowedAddressPairs: false subnets: - filter: {} uuid: primary-subnet-uuid - filter: {} noAllowedAddressPairs: true subnets: - filter: {} uuid: other-subnet-uuid primarySubnet: primary-subnet-uuid Then all the ports are created without the allowed address pairs. Doing some research in the source code, I have found that: - For each entry on the networks: section, networks are filtered as per its filter: section[1] - Then, if the subnets: section of the network entry is not empty, for each of the network IDs found above[2], 2 things are done that are relevant for this situatoin: - The net ID is saved on a netsWithoutAllowedAddressPairs[3]. That map is later checked while creating any port[4]. - For each subnet entry that matches the network ID, a port is created[5]. So, the problematic behavior happens due to the following: - Both entries in the networks array have empty filters. This means that both entries selected all the neutron networks. - This configuration results in one port per subnet as expected because, in the later traversal of the subnets array of each entry[5], it is filtering by subnet and creating a single port as expected. - However, the entry with "noAllowedAddressPairs: true" is selecting all the neutron networks, so it adds all of them to the netsWithoutAllowedAddressPairs map[3], regardless of the subnets filtering. - As all the networks are in noAllowedAddressPairs: true array, all the ports created for the VM have their allowed address pairs removed[4]. Why do we consider this behavior undesired? I understand that, if we create a port for a network that has no allowed pairs, we create all the other ports in the same networks without the pairs. However, it is surprising that a port in a network is removed the allowed address pairs due to a setting in an entry that yielded no port on that network. In other words, one would expect that the same subnet filtering that happens on each network entry in what regards yielding ports for the VM would also work for the noAllowedPairs parameter.
Version-Release number of selected component (if applicable):
4.10.30
How reproducible:
Always
Steps to Reproduce:
1. Create a machineset like in the description 2. 3.
Actual results:
All ports have no address pairs
Expected results:
Only the port on the secondary subnet has no address pairs.
Additional info:
A simple workaround would be to just fill the filter so that a single network is selected for each network entry. References: [1] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L576 [2] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L580 [3] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L581-L583 [4] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L658-L660 [5] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L610-L625
Description of problem: This is a follow-up to OCPBUGS-2795 and OCPBUGS-2941.
The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:
level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []
Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.
I'm setting the priority to critical as it's causing all our jobs to fail in master.
Version-Release number of selected component (if applicable):
4.8.z
How reproducible:
Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
-----------------------
On dualstack baremetal IPI cluster next error message is present in ovnkube logs:
oc logs -n openshift-ovn-kubernetes ovnkube-node-rvggh -c ovnkube-node
...
E0810 02:12:46.343460 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:13:16.347603 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:13:46.351108 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:14:16.355047 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:14:46.358950 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:15:13.313945 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Service total 9 items received
E0810 02:15:16.362737 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:15:46.366490 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:16:16.369963 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:16:24.306561 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Endpoints total 560 items received
E0810 02:16:46.373482 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:17:16.377497 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:17:46.380726 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:18:15.325871 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 50 items received
E0810 02:18:16.384732 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:18:38.299738 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 9 items received
E0810 02:18:46.388162 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:19:16.391669 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
OCP-4.10.26
ovn-2021-21.12.0-58.el8fdp.x86_64
ovn-2021-host-21.12.0-58.el8fdp.x86_64
ovn-2021-central-21.12.0-58.el8fdp.x86_64
ovn-2021-vtep-21.12.0-58.el8fdp.x86_64
How reproducible:
-----------------
so far spotted on 2 different clusters
Steps to Reproduce:
-------------------
1. Deploy dualstack baremetal IPI cluster with OVNKubernetesHybrid network(add next to cluster's config before running cluster install):
defaultNetwork:
type: OVNKubernetes
ovnKubernetesConfig:
hybridOverlayConfig:
hybridClusterNetwork: []
Actual results:
---------------
Error message in logs
Expected results:
-----------------
No error message in logs
Additional info:
----------------
Baremetal dualstack setup with 3 masters and 4 workers, bonding configured for baremetal network on masters and workers
This is a clone of issue OCPBUGS-1805. The following is the description of the original issue:
—
The vSphere CSI cloud.conf lists the single datacenter from platform workspace config but in a multi-zone setup (https://github.com/openshift/enhancements/pull/918 ) there may be more than the one datacenter.
This issue is resulting in PVs failing to attach because the virtual machines can't be find in any other datacenter. For example:
0s Warning FailedAttachVolume pod/image-registry-85b5d5db54-m78vp AttachVolume.Attach failed for volume "pvc-ab1a0611-cb3b-418d-bb3b-1e7bbe2a69ed" : rpc error: code = Internal desc = failed to find VirtualMachine for node:"rbost-zonal-ghxp2-worker-3-xm7gw". Error: virtual machine wasn't found
The machine above lives in datacenter-2 but the CSI cloud.conf is only aware of the datacenter IBMCloud.
$ oc get cm vsphere-csi-config -o yaml -n openshift-cluster-csi-drivers | grep datacenters
datacenters = "IBMCloud"
This is a clone of issue OCPBUGS-7474. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-6714. The following is the description of the original issue:
—
Description of problem:
Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46
a customer cluster was patched. It is an Openshift 4.10.46 cluster with SDN.
More description about issue is available in private comment below since it contains customer data.
This is a clone of issue OCPBUGS-4311. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4305. The following is the description of the original issue:
—
Description of problem:
Please add an option to DISABLE debug in ironic-api. Presently it is enabled by default and there is no way to disable it or reduce log level
https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3
Version-Release number of selected component (if applicable): none
How reproducible: Every time
Steps to Reproduce:
Please check source code here: https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3
It is enabled by default and there is no way to disable it or reduce log level
Actual results:
Please check Case: 03371411, the log file grew to 409 GB
Expected results: Need a way to disable debug
Additional info: Case 03371411. A cluster must gather and log file can be found in the case.
Before platformStatus, the operator used to get information about AWS and GCP from the install-config config map. This code can be removed.
In order to delete the correct GCP cloud resources, the "--credentials-requests-dir" parameter must be passed to "ccoctl gcp delete". This was fixed for 4.12 as part of https://github.com/openshift/cloud-credential-operator/pull/489 but must be backported for previous releases. See https://github.com/openshift/cloud-credential-operator/pull/489#issuecomment-1248733205 for discussion regarding this bug.
To reproduce, create GCP infrastructure with a name parameter that is a subset of another set of GCP infrastructure's name parameter. I will "ccoctl gcp create all" with "name=abutcher-gcp" and "name=abutcher-gcp1".
$ ./ccoctl gcp create-all \ --name=abutcher-gcp \ --region=us-central1 \ --project=openshift-hive-dev \ --credentials-requests-dir=./credrequests $ ./ccoctl gcp create-all \ --name=abutcher-gcp1 \ --region=us-central1 \ --project=openshift-hive-dev \ --credentials-requests-dir=./credrequests
Running "ccoctl gcp delete --name=abutcher-gcp" will result in GCP infrastructure for both "abutcher-gcp" and "abutcher-gcp1" being deleted.
$ ./ccoctl gcp delete --name abutcher-gcp --project openshift-hive-dev 2022/10/24 11:30:06 Credentials loaded from file "/home/abutcher/.gcp/osServiceAccount.json" 2022/10/24 11:30:06 Deleted object .well-known/openid-configuration from bucket abutcher-gcp-oidc 2022/10/24 11:30:07 Deleted object keys.json from bucket abutcher-gcp-oidc 2022/10/24 11:30:07 OIDC bucket abutcher-gcp-oidc deleted 2022/10/24 11:30:09 IAM Service account abutcher-gcp-openshift-image-registry-gcs deleted 2022/10/24 11:30:10 IAM Service account abutcher-gcp-openshift-gcp-ccm deleted 2022/10/24 11:30:11 IAM Service account abutcher-gcp1-openshift-cloud-network-config-controller-gcp deleted 2022/10/24 11:30:12 IAM Service account abutcher-gcp-openshift-machine-api-gcp deleted 2022/10/24 11:30:13 IAM Service account abutcher-gcp-openshift-ingress-gcp deleted 2022/10/24 11:30:15 IAM Service account abutcher-gcp-openshift-gcp-pd-csi-driver-operator deleted 2022/10/24 11:30:16 IAM Service account abutcher-gcp1-openshift-ingress-gcp deleted 2022/10/24 11:30:17 IAM Service account abutcher-gcp1-openshift-image-registry-gcs deleted 2022/10/24 11:30:19 IAM Service account abutcher-gcp-cloud-credential-operator-gcp-ro-creds deleted 2022/10/24 11:30:20 IAM Service account abutcher-gcp1-openshift-gcp-pd-csi-driver-operator deleted 2022/10/24 11:30:21 IAM Service account abutcher-gcp1-openshift-gcp-ccm deleted 2022/10/24 11:30:22 IAM Service account abutcher-gcp1-cloud-credential-operator-gcp-ro-creds deleted 2022/10/24 11:30:24 IAM Service account abutcher-gcp1-openshift-machine-api-gcp deleted 2022/10/24 11:30:25 IAM Service account abutcher-gcp-openshift-cloud-network-config-controller-gcp deleted 2022/10/24 11:30:25 Workload identity pool abutcher-gcp deleted
This is a clone of issue OCPBUGS-1677. The following is the description of the original issue:
—
Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)
This issue is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.
OCPBUGS-1678 is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh
Actual results:
Unit tests fail
Expected results:
Unit tests should pass again
Additional info:
Description of problem:
Follow-up of: https://issues.redhat.com/browse/SDN-2988
This failure is perma-failing in the e2e-metal-ipi-ovn-dualstack-local-gateway jobs.
Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway/1597574181430497280
Search CI: https://search.ci.openshift.org/?search=when+using+openshift+ovn-kubernetes+should+ensure+egressfirewall+is+created&maxAge=336h&context=1&type=junit&name=e2e-metal-ipi-ovn-dualstack-local-gateway&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Sippy: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.13/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway%22%7D%5D%7D
Version-Release number of selected component (if applicable):
4.12,4.13
How reproducible:
Every time
Steps to Reproduce:
1. Setup dualstack KinD cluster 2. Create egress fw policy with spec Spec: Egress: To: Cidr Selector: 0.0.0.0/0 Type: Deny 3. create a pod and ping to 1.1.1.1
Actual results:
Egress policy does not block flows to external IP
Expected results:
Egress policy blocks flows to external IP
Additional info:
It seems mixing ip4 and ip6 operands in ACL matchs doesnt work
This is a clone of issue OCPBUGS-3114. The following is the description of the original issue:
—
Description of problem:
When running a Hosted Cluster on Hypershift the cluster-networking-operator never progressed to Available despite all the components being up and running
Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters hypershift operator is quay.io/hypershift/hypershift-operator:4.11 4.11.9 management cluster
How reproducible:
Happened once
Steps to Reproduce:
1. 2. 3.
Actual results:
oc get co network reports False availability
Expected results:
oc get co network reports True availability
Additional info:
Description of problem:
AWS tagging - when applying user defined tags you cannot add more than 10
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Whereabouts doesn't allow the use of network interface names that are not preceded by the prefix "net", see https://github.com/k8snetworkplumbingwg/whereabouts/issues/130.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Define two Pods, one with the interface name 'port1' and the other with 'net-port1':
test-ip-removal-port1: k8s.v1.cni.cncf.io/networks: [ { "name": "test-sriovnd", "interface": "port1", "namespace": "default" } ] test-ip-removal-net-port1: k8s.v1.cni.cncf.io/networks: [ { "name": "test-sriovnd", "interface": "net-port1", "namespace": "default" } ]
2. IP allocated in the IPPool:
kind: IPPool ... spec: allocations: "16": id: ... podref: test-ecoloma-1/test-ip-removal-port1 "17": id: ... podref: test-ecoloma-1/test-ip-removal-net-port1
3. When the ip-reconciler job is run, the allocation for the port with the interface name 'port1' is removed:
[13:29][]$ oc get cronjob -n openshift-multus
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
ip-reconciler */15 * * * * False 0 14m 11d
[13:29][]$ oc get ippools.whereabouts.cni.cncf.io -n openshift-multus 2001-1b70-820d-2610---64 -o yaml
apiVersion: whereabouts.cni.cncf.io/v1alpha1
kind: IPPool
metadata:
...
spec:
allocations:
"17":
id: ...
podref: test-ecoloma-1/test-ip-removal-net-port1
range: 2001:1b70:820d:2610::/64
[13:30][]$ oc get cronjob -n openshift-multus
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
ip-reconciler */15 * * * * False 0 9s 11d
Actual results:
The network interface with a name that doesn't have a 'net' prefix is removed from the ip-reconciler cronjob.
Expected results:
The network interface must not be removed, regardless of the name.
Additional info:
Upstream PR @ https://github.com/k8snetworkplumbingwg/whereabouts/pull/147 master PR @ https://github.com/openshift/whereabouts-cni/pull/94
This bug was initially created as a copy of
Bug #2096605
I am copying this bug because: the parent bug solved the validation aspect of diskType but now the description of diskType in
https://github.com/openshift/installer/blob/master/data/data/install.openshift.io_installconfigs.yaml#L2914-L2923
needs to be updated.
Version: 4.11.0-0.nightly-2022-06-06-201913
Platform: vSphere IPI
What happened?
1. If user inputs an invalid value for platform.vsphere.diskType in install-config.yaml file, there is no validation checking for diskType and doesn't exit with error, but continues the installation, which is not the same behavior as in 4.10.
After all vms are provisioned, I checked that the disk provision type is thick.
2. If user doesn't set platform.vsphere.diskType in install-config.yaml file, the default disk provision type is thick, but not the vSphere default storage policy. On VMC, the default policy is thin, so maybe the description of diskType should also need to be updated.
$ ./openshift-install explain installconfig.platform.vsphere.diskType
KIND: InstallConfig
VERSION: v1
RESOURCE: <string>
Valid Values: "","thin","thick","eagerZeroedThick"
DiskType is the name of the disk provisioning type, valid values are thin, thick, and eagerZeroedThick. When not specified, it will be set according to the default storage policy of vsphere.
What did you expect to happen?
validation for diskType
How to reproduce it (as minimally and precisely as possible)?
set diskType to invalid value in install-config.yaml and install the cluster
Goal
We have several use cases where dynamic plugins need to proxy to another service on the cluster. One example is the Helm plugin. We would like to move the backend code for Helm to a separate service on the cluster, and the Helm plugin could proxy to that service for its requests. This is required to make Helm a dynamic plugin. Similarly if we want to have ACM contribute any views through dynamic plugins, we will need a way for ACM to proxy to its services (e.g., for Search).
It's possible for plugins to make requests to services exposed through routes today, but that has several problems:
Plugins need a way to declare in-cluster services that they need to connect to. The console backend will need to set up proxies to those services on console load. This also requires that the console operator be updated to pass the configuration to the console backend.
This work will apply only to single clusters.
Open Questions
Acceptance Criteria
cc Ali Mobrem [~christianmvogt]
Description of problem:
Provisioning interface on master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP 4.10.16 IPI BareMetal install.
Customer is performing an OCP 4.10.16 IPI BareMetal install and bootstrap node provisions just fine, but when master nodes are booted for provisioning, they are not getting an ipv4 address via dhcp. As such, the install is not moving forward at this point.
Version-Release number of selected component (if applicable):
OCP 4.10.16
How reproducible:
Perform OCP 4.10.16 IPI BareMetal install.
Actual results:
provisioning interface comes up (as evidenced by ipv6 address) but is not getting an ipv4 address via dhcp. OCP install / provisioning fails at this point.
Expected results:
provisioning interface successfully received an ipv4 ip address and successfully provisioned master nodes (and subsequently worker nodes as well.)
Additional info:
As a troubleshooting measure, manually adding an ipv4 ip address did allow the coreos image on the bootstrap node to be reached via curl.
Further, the kernel boot line for the first master node was updated for a static ip addresss assignment for further confirmation that the master node would successfully image this way which further confirming that the issue is the provisioning interface not receiving an ipv4 ip address from the dhcp server.
Description of problem:
Users search a resource (for example, Pod) with Name filter applied and input a text to the filter field then the search results filtered accordingly.
Once the results are shown, when the user clear the value in one-shot (i.e. select whole filter text from the field and clear it using delete or backspace key) from the field,
then the search result doesn't clear accordingly and the previous result stays on the page.
Version-Release number of selected components (if applicable):
4.11.0-0.nightly-2022-08-16-194731 & works fine with OCP 4.12 latest version.
How reproducible:
Always
Steps to Reproduce:
Actual results:
Search result doesn't clear when user clears name filter in one-shot for any resources.
Expected results:
Search results should clear when the user clears name filters in one-shot for any resources.
Additional info:
Reproduced in both chrome[103.0.5060.114 (Official Build) (64-bit)] and firefox[91.11.0esr (64-bit)] browsers.
Attached screen share for the same issue. SearchIssues.mp4
Description of problem:
Upgrade OCP 4.11 --> 4.12 fails with one 'NotReady,SchedulingDisabled' node and MachineConfigDaemonFailed.
Version-Release number of selected component (if applicable):
Upgrade from OCP 4.11.0-0.nightly-2022-09-19-214532 on top of OSP RHOS-16.2-RHEL-8-20220804.n.1 to 4.12.0-0.nightly-2022-09-20-040107. Network Type: OVNKubernetes
How reproducible:
Twice out of two attempts.
Steps to Reproduce:
1. Install OCP 4.11.0-0.nightly-2022-09-19-214532 (IPI) on top of OSP RHOS-16.2-RHEL-8-20220804.n.1. The cluster is up and running with three workers: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-09-19-214532 True False 51m Cluster version is 4.11.0-0.nightly-2022-09-19-214532 2. Run the OC command to upgrade to 4.12.0-0.nightly-2022-09-20-040107: $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 --allow-explicit-upgrade --force=true warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requesting update to release image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 3. The upgrade is not succeeds: [0] $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-09-19-214532 True True 17h Unable to apply 4.12.0-0.nightly-2022-09-20-040107: wait has exceeded 40 minutes for these operators: network One node degrided to 'NotReady,SchedulingDisabled' status: $ oc get nodes NAME STATUS ROLES AGE VERSION ostest-9vllk-master-0 Ready master 19h v1.24.0+07c9eb7 ostest-9vllk-master-1 Ready master 19h v1.24.0+07c9eb7 ostest-9vllk-master-2 Ready master 19h v1.24.0+07c9eb7 ostest-9vllk-worker-0-4x4pt NotReady,SchedulingDisabled worker 18h v1.24.0+3882f8f ostest-9vllk-worker-0-h6kcs Ready worker 18h v1.24.0+3882f8f ostest-9vllk-worker-0-xhz9b Ready worker 18h v1.24.0+3882f8f $ oc get pods -A | grep -v -e Completed -e Running NAMESPACE NAME READY STATUS RESTARTS AGE openshift-openstack-infra coredns-ostest-9vllk-worker-0-4x4pt 0/2 Init:0/1 0 18h $ oc get events LAST SEEN TYPE REASON OBJECT MESSAGE 7m15s Warning OperatorDegraded: MachineConfigDaemonFailed /machine-config Unable to apply 4.12.0-0.nightly-2022-09-20-040107: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] 7m15s Warning MachineConfigDaemonFailed /machine-config Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-09-20-040107 True False False 18h baremetal 4.12.0-0.nightly-2022-09-20-040107 True False False 19h cloud-controller-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 19h cloud-credential 4.12.0-0.nightly-2022-09-20-040107 True False False 19h cluster-autoscaler 4.12.0-0.nightly-2022-09-20-040107 True False False 19h config-operator 4.12.0-0.nightly-2022-09-20-040107 True False False 19h console 4.12.0-0.nightly-2022-09-20-040107 True False False 18h control-plane-machine-set 4.12.0-0.nightly-2022-09-20-040107 True False False 17h csi-snapshot-controller 4.12.0-0.nightly-2022-09-20-040107 True False False 19h dns 4.12.0-0.nightly-2022-09-20-040107 True True False 19h DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6." etcd 4.12.0-0.nightly-2022-09-20-040107 True False False 19h image-registry 4.12.0-0.nightly-2022-09-20-040107 True True False 18h Progressing: The registry is ready... ingress 4.12.0-0.nightly-2022-09-20-040107 True False False 18h insights 4.12.0-0.nightly-2022-09-20-040107 True False False 19h kube-apiserver 4.12.0-0.nightly-2022-09-20-040107 True True False 18h NodeInstallerProgressing: 1 nodes are at revision 11; 2 nodes are at revision 13 kube-controller-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 19h kube-scheduler 4.12.0-0.nightly-2022-09-20-040107 True False False 19h kube-storage-version-migrator 4.12.0-0.nightly-2022-09-20-040107 True False False 19h machine-api 4.12.0-0.nightly-2022-09-20-040107 True False False 19h machine-approver 4.12.0-0.nightly-2022-09-20-040107 True False False 19h machine-config 4.11.0-0.nightly-2022-09-19-214532 False True True 16h Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] marketplace 4.12.0-0.nightly-2022-09-20-040107 True False False 19h monitoring 4.12.0-0.nightly-2022-09-20-040107 True False False 18h network 4.12.0-0.nightly-2022-09-20-040107 True True True 19h DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-09-20T14:16:13Z... node-tuning 4.12.0-0.nightly-2022-09-20-040107 True False False 17h openshift-apiserver 4.12.0-0.nightly-2022-09-20-040107 True False False 18h openshift-controller-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 17h openshift-samples 4.12.0-0.nightly-2022-09-20-040107 True False False 17h operator-lifecycle-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 19h operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-09-20-040107 True False False 19h operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-09-20-040107 True False False 19h service-ca 4.12.0-0.nightly-2022-09-20-040107 True False False 19h storage 4.12.0-0.nightly-2022-09-20-040107 True True False 19h ManilaCSIDriverOperatorCRProgressing: ManilaDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods... [0] http://pastebin.test.redhat.com/1074531
Actual results:
OCP 4.11 --> 4.12 upgrade fails.
Expected results:
OCP 4.11 --> 4.12 upgrade success.
Additional info:
Attached logs of the NotReady node - [^journalctl_ostest-9vllk-worker-0-4x4pt.log.tar.gz]
we need to make sure that the ironic containers use the latest available bugfix versions
This is a clone of issue OCPBUGS-6766. The following is the description of the original issue:
—
This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2083087 (OCPBUGSM-44070) to backport this issue.
Description of problem:
"Delete dependent objects of this resource" is a bit of confusing for some users because when creating the Application in Dev console not only the deployment but also IS, route, svc, secret objects will be created as well. When deleting the Application (in fact it is deployment), there is an option called "Delete dependent objects of this resource" and some users might think this means the IS, route, svc and any other objects which are created alongside with the deployment will be deleted as well
Version-Release number of selected component (if applicable):
4.8
How reproducible:
Always
Steps to Reproduce:
1. Create Application in Dev console
2. Delete the deployment
3. Check "Delete dependent objects of this resource"
Actual results:
Only deployment will be deleted and IS, svc, route will not be deleted
Expected results:
We either change the description of this option, or we really delete IS, svc, route and any other objects created under this Application.
Additional info:
This is a clone of issue OCPBUGS-2181. The following is the description of the original issue:
—
Description of problem:
E2E test Installs Red Hat Integration - 3scale operator test is failing due to change of Operator name
While running a PerfScale test we noticed that the hosted ovnkube-master pods always initially error on deployment. They eventually succeed on retry however.
This is running quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters and the hypershift operator is quay.io/hypershift/hypershift-operator:4.11 on a 4.11.9 management cluster.
An example of the error in the ovnkube-master container:
```
F1102 13:27:51.935600 1 ovnkube.go:133] error when trying to initialize libovsdb SB client: unable to connect to any endpoints: failed to connect to ssl:ovnkube-master-0.ovnkube-master-internal.clusters-perf-pqd-0021.svc.cluster.local:9642: failed to open connection: dial tcp 10.131.8.25:9642: connect: connection refused. failed to connect to ssl:ovnkube-master-1.ovnkube-master-internal.clusters-perf-pqd-0021.svc.cluste
```
This is a clone of issue OCPBUGS-4696. The following is the description of the original issue:
—
Description of problem:
metal3 pod does not come up on SNO when creating Provisioning with provisioningNetwork set to Disabled The issue is that on SNO, there is no Machine, and no BareMetalHost, it is looking of Machine objects to populate the provisioningMacAddresses field. However, when provisioningNetwork is Disabled, provisioningMacAddresses is not used anyway. You can work around this issue by populating provisioningMacAddresses with a dummy address, like this: kind: Provisioning metadata: name: provisioning-configuration spec: provisioningMacAddresses: - aa:aa:aa:aa:aa:aa provisioningNetwork: Disabled watchAllNamespaces: true
Version-Release number of selected component (if applicable):
4.11.17
How reproducible:
Try to bring up Provisioning on SNO in 4.11.17 with provisioningNetwork set to Disabled apiVersion: metal3.io/v1alpha1 kind: Provisioning metadata: name: provisioning-configuration spec: provisioningNetwork: Disabled watchAllNamespaces: true
Steps to Reproduce:
1. 2. 3.
Actual results:
controller/provisioning "msg"="Reconciler error" "error"="machines with cluster-api-machine-role=master not found" "name"="provisioning-configuration" "namespace"="" "reconciler group"="metal3.io" "reconciler kind"="Provisioning"
Expected results:
metal3 pod should be deployed
Additional info:
This issue is a result of this change: https://github.com/openshift/cluster-baremetal-operator/pull/307 See this Slack thread: https://coreos.slack.com/archives/CFP6ST0A3/p1670530729168599
This is a clone of issue OCPBUGS-5191. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5164. The following is the description of the original issue:
—
Description of problem:
It looks like the ODC doesn't register KNATIVE_SERVING and KNATIVE_EVENTING flags. Those are based on KnativeServing and KnativeEventing CRs, but they are looking for v1alpha1 version of those: https://github.com/openshift/console/blob/f72519fdf2267ad91cc0aa51467113cc36423a49/frontend/packages/knative-plugin/console-extensions.json#L6-L8
This PR https://github.com/openshift-knative/serverless-operator/pull/1695 moved the CRs to v1beta1, and that breaks that ODC discovery.
Version-Release number of selected component (if applicable):
Openshift 4.8, Serverless Operator 1.27
Additional info:
https://coreos.slack.com/archives/CHGU4P8UU/p1671634903447019
Description of problem:
When a pod runs to a completed state, we typically rely on the update event that will indicate to us that this pod is completed. At that point the pod IP is released and the port configuration is removed in OVN. The subsequent delete event for this pod will be ignored because it should have been cleaned up in the previous update. However, there can be cases where the update event is missed with pod completed. In this case we will only receive a delete with pod completed event, and ignore tearing down the pod. The end result is the pod is not cleaned up in OVN and the IP address remains allocated, reducing the amount of address range available to launch another pod. This can lead to exhausting all IP addresses available for pod allocation on a node.
Version-Release number of selected component (if applicable):
4.10.24
How reproducible:
Not sure how to reproduce this. I'm guessing some lag in kapi updates can cause the completed update event and the final delete event to be combined into a single event.
Steps to Reproduce:
1. 2. 3.
Actual results:
Port still exists in OVN, IP remains allocated for a deleted pod.
Expected results:
IP should be freed, port should be removed from OVN.
Additional info:
Node healthz server was added in 4.13 with https://github.com/openshift/ovn-kubernetes/commit/c8489e3ff9c321e77f265dc9d484ed2549df4a6b and https://github.com/openshift/ovn-kubernetes/commit/9a836e3a547f3464d433ce8b9eef336624d51858. We need to configure it by default on 0.0.0.0:10256 on CNO for ovnk, just like we do for sdn.
Description of problem: Backport owners for 4.11 (only)
This is a clone of issue OCPBUGS-533. The following is the description of the original issue:
—
Description of problem:
customer is using Azure AD as openid provider and groups synchronization from the provider.
The scenario is the following:
1)
2)
3)
The groups memberships are the same in step 2 and 3.
The cluster role bindings of the groups have never changed.
the only way to have user A again the admin rights is to delete the membership from the group and have user A login again.
I have not managed to reproduce this using RH SSO. Neither Azure AD.
But my configuration is not exactly the same yet.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3235. The following is the description of the original issue:
—
Frequently we see the loading state of the topology view, even when there aren't many resources in the project.
Including an example
topology will sometimes hang with the loading indicator showing indefinitely
topology should load consistently without fail
intermittent
4.9
Description of problem:
With every pod update we are executing a mutate operation to add the pod port to the port group or add the pod IP to an address set. This functionally doesn't hurt, since mutate will not add duplicate values to the same set. However, this is bad for performance. For example, with a 730 network policies affecting a pod, and issuing 7 pod updates would result in over 5k transactions.
During initial backporting, due to a number of other colliding commits in upstream, the cobra commands facilitating caching did not get downstreamed.
This is to downstream those two lines.
Description of problem:
intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy
Version-Release number of selected component (if applicable):
OpenShift 4.10.12
How reproducible:
Always
Steps to Reproduce:
1. Define deny all network policy for egress an ingress in a namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
2. Define the following network policy to allow the traffic between the pods in the namespace:
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-intra-namespace-001 spec: egress: - to: - podSelector: {} ingress: - from: - podSelector: {} podSelector: {} policyTypes: - Ingress - Egress
3. Test the connectivity between two pods from the namespace.
Actual results:
The connectivity is not allowed
Expected results:
The connectivity should be allowed between pods from the same namespace.
Additional info:
After performing a test and analyzing SDN flows for the namespace:
sh-4.4# ovs-ofctl dump-flows -O OpenFlow13 br0 | grep --color 0x964376 cookie=0x0, duration=99375.342s, table=20, n_packets=14, n_bytes=588, priority=100,arp,in_port=21,arp_spa=10.128.2.20,arp_sha=00:00:0a:80:02:14/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=1681.845s, table=20, n_packets=11, n_bytes=462, priority=100,arp,in_port=24,arp_spa=10.128.2.23,arp_sha=00:00:0a:80:02:17/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=99375.342s, table=20, n_packets=135610, n_bytes=759239814, priority=100,ip,in_port=21,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=1681.845s, table=20, n_packets=2006, n_bytes=12684967, priority=100,ip,in_port=24,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=99375.342s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=1681.845s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30 cookie=0x0, duration=99375.342s, table=70, n_packets=145260, n_bytes=11722173, priority=100,ip,nw_dst=10.128.2.20 actions=load:0x964376->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80 cookie=0x0, duration=1681.845s, table=70, n_packets=2336, n_bytes=191079, priority=100,ip,nw_dst=10.128.2.23 actions=load:0x964376->NXM_NX_REG1[],load:0x18->NXM_NX_REG2[],goto_table:80 cookie=0x0, duration=975.129s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=output:NXM_NX_REG2[]
We see that the following rule doesn't match because `reg1` hasn't been defined:
cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30
This is a clone of issue OCPBUGS-501. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable): 4.10.16
How reproducible: Always
Steps to Reproduce:
1. Edit the apiserver resource and add spec.audit.customRules field
$ oc get apiserver cluster -o yaml
spec:
audit:
customRules:
2. Allow the kube-apiserver pods to rollout new revision.
3. Once the kube-apiserver pods are in new revision execute $ oc get dc
Actual results:
Error from server (InternalError): an error on the server ("This request caused apiserver to panic. Look in the logs for details.") has prevented the request from succeeding (get deploymentconfigs.apps.openshift.io)
Expected results: The command "oc get dc" should display the deploymentconfig without any error.
Additional info:
Description of problem:
The alertmanager pod is stuck on OCP 4.11 with OVN in container Creating State From oc describe alertmanager pod: ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 16s (x459 over 17h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-managed-ocs-alertmanager-0_openshift-storage_3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc_0(88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78): error adding pod openshift-storage_alertmanager-managed-ocs-alertmanager-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-storage/alertmanager-managed-ocs-alertmanager-0/3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-storage/alertmanager-managed-ocs-alertmanager-0 88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78] [openshift
Version-Release number of selected component (if applicable):
OCP 4.11 with OVN
How reproducible:
100%
Steps to Reproduce:
1. Terminate the node on which alertmanager pod is running 2. pod will get stuck in container Creating state 3.
Actual results:
AlertManager pod is stuck in container Creating state
Expected results:
Alertmanager pod is ready
Additional info:
The workaround would be to terminate the alertmanager pod
Description of problem:
In order to understand what is going on with OCPBUGS-5379 we want to add more logs
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-6850. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-6503. The following is the description of the original issue:
—
Description of problem:
While looking into OCPBUGS-5505 I discovered that some 4.10->4.11 upgrade job runs perform an Admin Ack check, while some do not. 4.11 has a ack-4.11-kube-1.25-api-removals-in-4.12 gate, so these upgrade jobs sometimes test that Upgradeable goes false after the ugprade, and sometimes they do not. This is only determined by the polling race condition: the check is executed once per 10 minutes, and we cancel the polling after upgrade is completed. This means that in some cases we are lucky and manage to run one check before the cancel, and sometimes we are not and only check while still on the base version.
Example job that checked admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired' Jan 6 21:16:40.153: INFO: Waiting for Upgradeable to be AdminAckRequired ...
Example job that did not check admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'
Version-Release number of selected component (if applicable):
4.11+ openshift-tests
How reproducible:
nondeterministic, wild guess is ~30% of upgrade jobs
Steps to Reproduce:
1. Inspect the E2E test log of an upgrade jobs and compare the time of the update ("Completed upgrade") with the time of the last check ( "Skipping admin ack", "Gate .* not applicable to current version", "Admin Ack verified') done by the admin ack test
Actual results:
Jan 23 00:47:43.842: INFO: Admin Ack verified Jan 23 00:57:43.836: INFO: Admin Ack verified Jan 23 01:07:43.839: INFO: Admin Ack verified Jan 23 01:17:33.474: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z09ll8fw/release@sha256:322cf67dc00dd6fa4fdd25c3530e4e75800f6306bd86c4ad1418c92770d58ab8
No check done after the upgrade
Expected results:
Jan 23 00:57:37.894: INFO: Admin Ack verified Jan 23 01:07:37.894: INFO: Admin Ack verified Jan 23 01:16:43.618: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z8h5x1c5/release@sha256:9c4c732a0b4c2ae887c73b35685e52146518e5d2b06726465d99e6a83ccfee8d Jan 23 01:17:57.937: INFO: Admin Ack verified
One or more checks done after upgrade
This is a clone of issue OCPBUGS-3022. The following is the description of the original issue:
—
Description of problem:
** europe-west8 ** europe-west9 ** europe-southwest1 ** southamerica-west1 ** us-east5 ** us-south1 are not listed in the survey
Steps to Reproduce:
1. run survey (openshift-install create install-config without an install config file) 2. go through prompts until regions 3.
Actual results:
listed regions are missing
Expected results:
regions above are listed (and install succeeds in the region)
Description of problem:
For some reason, the LSP of a pod is not properly added to the port group where the ACL of a NetworkPolicy is applied. This results on the networkpolicy not being applied to the pod and communication not possible.
Version-Release number of selected component (if applicable):
4.10
How reproducible:
Always with a concrete pod at customer environment.
Steps to Reproduce:
(not known exactly yet)
Actual results:
LSP not in port group. ACL not applied. Netpol not in effect.
Expected results:
LSP in port group. ACL applied. Netpol in effect.
Additional info:
Details in private comments, as they involve sensitive data. Deleting the pod does nothing, but it is possible that this has something to do with the pod being recreated with the same name (although the LSPs UUIDs are different in each incarnation).
Description of problem:
During restart egress firewall acls will be deleted and re-created from scratch, meaning that egress firewall rules won't be applied for some time during restart
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-4851. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4850. The following is the description of the original issue:
—
Description of problem:
Kuryr might take a while to create Pods because it has to create Neutron ports for the pods. If a pod gets deleted while this is being processed, a warning Event will be generated causing the "[sig-network] pods should successfully create sandboxes by adding pod to network" to fail.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The relevant code in ironic-image was not updated to support TLS, so it still uses the old port and explicit http://
This is a clone of issue OCPBUGS-2895. The following is the description of the original issue:
—
Description of problem:
Current validation will not accept Resource Groups or DiskEncryptionSets which have upper-case letters.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Attempt to create a cluster/machineset using a DiskEncryptionSet with an RG or Name with upper-case letters
Steps to Reproduce:
1. Create cluster with DiskEncryptionSet with upper-case letters in DES name or in Resource Group name
Actual results:
See error message: encountered error: [controlPlane.platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format, compute[0].platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format]
Expected results:
Create a cluster/machineset using the existing and valid DiskEncryptionSet
Additional info:
I have submitted a PR for this already, but it needs to be reviewed and backported to 4.11: https://github.com/openshift/installer/pull/6513
Description of problem:
To address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Already merged in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/660
This is a clone of issue OCPBUGS-5879. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5505. The following is the description of the original issue:
—
The upgradeability check in CVO is throttled (essentially cached) for a nondeterministic period of time, same as the minimal sync period computed at runtime. The period can be up to 4 minutes, determined at CVO start time as 2minutes * (0..1 + 1). We agreed with Trevor that such throttling is unnecessarily aggressive (the check is not that expensive). It also causes CI flakes, because the matching test only has 3 minutes timeout. Additionally, the non-determinism and longer throttling results makes UX worse by actions done in the cluster may have their observable effect delayed.
discovered in 4.10 -> 4.11 upgrade jobs
The test seems to flake ~10% of 4.10->4.11 Azure jobs (sippy). There does not seem to be that much impact on non-Azure jobs though which is a bit weird.
Inspect the CVO log and E2E logs from failing jobs with the provided [^check-cvo.py] helper:
$ ./check-cvo.py cvo.log && echo PASS || echo FAIL
Preferably, inspect CVO logs of clusters that just underwent an upgrade (upgrades makes the original problematic behavior more likely to surface)
$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-5b6966c474-g4kwk_cluster-version-operator.log && echo PASS || echo FAIL FAIL: Cache hit at 11:59:55.332339 0:03:13.665006 after check at 11:56:41.667333 FAIL: Cache hit at 12:06:22.663215 0:03:13.664964 after check at 12:03:08.998251 FAIL: Cache hit at 12:12:49.997119 0:03:13.665598 after check at 12:09:36.331521 FAIL: Cache hit at 12:19:17.328510 0:03:13.664906 after check at 12:16:03.663604 FAIL: Cache hit at 12:25:44.662290 0:03:13.666759 after check at 12:22:30.995531 Upgradeability checks: 5 Upgradeability check cache hits: 12 FAIL
Note that the bug is probabilistic, so not all unfixed clusters will exhibit the behavior. My guess of the incidence rate is about 30-40%.
$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-7b8f85d455-mk9fs_cluster-version-operator.log && echo PASS || echo FAIL Upgradeability checks: 12 Upgradeability check cache hits: 11 PASS
The actual numbers are not relevant (unless the upgradeabilily check count is zero, which means the test is not conclusive, the script warns about that), lack of failure is.
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b7d4b5bbd-zjqdt_cluster-version-operator.log | grep upgradeable.go ... I1227 06:50:59.023190 1 upgradeable.go:122] Cluster current version=4.10.46 I1227 06:50:59.042735 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:51:14.024345 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:53:23.080768 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:56:59.366010 1 upgradeable.go:122] Cluster current version=4.11.0-0.ci-2022-12-26-193640 $ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Kubernetes 1.25 and therefore OpenShift 4.12' Dec 27 06:51:15.319: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ... Dec 27 06:54:15.413: FAIL: Error while waiting for Upgradeable to complain about AdminAckRequired with message "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.": timed out waiting for the condition
The test passes. Also, the "Upgradeable conditions were recently checked, will try later." messages in CVO logs should never occur after a deterministic, short amount of time (I propose 1 minute) after upgradeability was checked.
I tested the throttling period in https://github.com/openshift/cluster-version-operator/pull/880. With the period of 15m, the test passrate was 4 of 9. Wiht the period of 1m, the test did not fail at all.
Some context in Slack thread
This is a clone of issue OCPBUGS-10622. The following is the description of the original issue:
—
Description of problem:
Unit test failing === RUN TestNewAppRunAll/app_generation_using_context_dir newapp_test.go:907: app generation using context dir: Error mismatch! Expected <nil>, got supplied context directory '2.0/test/rack-test-app' does not exist in 'https://github.com/openshift/sti-ruby' --- FAIL: TestNewAppRunAll/app_generation_using_context_dir (0.61s)
Version-Release number of selected component (if applicable):
How reproducible:
100
Steps to Reproduce:
see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1376/pull-ci-openshift-oc-master-images/1638172620648091648
Actual results:
unit tests fail
Expected results:
TestNewAppRunAll unit test should pass
Additional info:
Since 4.11 OCP comes with OperatorHub definition which declares a capability
and enables all catalog sources. For OKD we want to enable just community-operators
as users may not have Red Hat pull secret set.
This commit would ensure that OKD version of marketplace operator gets
its own OperatorHub manifest with a custom set of operator catalogs enabled
This is a clone of issue OCPBUGS-3021. The following is the description of the original issue:
—
Description of problem:
me-west1 is not listed in the survey
Steps to Reproduce:
1. run survey (openshift-install create install-config without an install config file) 2. go through prompts until regions 3.
Actual results:
me-west1 region is missing
Expected results:
me-west1 region is listed (and install succeeds in the region)
Description of problem:
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Always
Steps to Reproduce:
1. Enable UWM + dedicated UWM Alertmanager
2. Deploy an application + service monitor + alerting rule which fires always
3. Go to the OCP dev console and silence the alert.
Actual results:
Nothing happens
Expected results:
The alert notification is muted.
Additional info:
Copied from https://bugzilla.redhat.com/show_bug.cgi?id=2100860
Add a Makefile rule in CMO to execute all the different rule that are used for verification and validation. Currenctly, some of them might not be at the right place, for example `check-assets` which is part of `generate` despite not being responsible of any generation. https://github.com/openshift/cluster-monitoring-operator/pull/1151/files#r629371735
DoD:
Description of problem:
Availability Set will be created when vmSize is invalid in a region which has zones, but Availability Set should only be created in a region which don’t have zones.
Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-10-07-174524 4.10.0-0.nightly-2022-10-07-205844
How reproducible:
Always
Steps to Reproduce:
1.Set up a cluster in a region which has zones. liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-az410-99qcm-master-0 Running Standard_D8s_v3 eastus 2 34m huliu-az410-99qcm-master-1 Running Standard_D8s_v3 eastus 3 34m huliu-az410-99qcm-master-2 Running Standard_D8s_v3 eastus 1 34m huliu-az410-99qcm-worker-eastus1-xld58 Running Standard_D4s_v3 eastus 1 27m huliu-az410-99qcm-worker-eastus2-chzg8 Running Standard_D4s_v3 eastus 2 27m huliu-az410-99qcm-worker-eastus3-7g2mw Running Standard_D4s_v3 eastus 3 27m 2.Create a machineset with invalid vmSize liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms4.yaml machineset.machine.openshift.io/huliu-az410-99qcm-1 created liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-az410-99qcm-1-cfw6w Failed 8s huliu-az410-99qcm-master-0 Running Standard_D8s_v3 eastus 2 35m huliu-az410-99qcm-master-1 Running Standard_D8s_v3 eastus 3 35m huliu-az410-99qcm-master-2 Running Standard_D8s_v3 eastus 1 35m huliu-az410-99qcm-worker-eastus1-xld58 Running Standard_D4s_v3 eastus 1 28m huliu-az410-99qcm-worker-eastus2-chzg8 Running Standard_D4s_v3 eastus 2 28m huliu-az410-99qcm-worker-eastus3-7g2mw Running Standard_D4s_v3 eastus 3 28m liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-az410-99qcm-1-cfw6w -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: Unknown creationTimestamp: "2022-10-08T07:42:28Z" finalizers: - machine.machine.openshift.io generateName: huliu-az410-99qcm-1- generation: 2 labels: machine.openshift.io/cluster-api-cluster: huliu-az410-99qcm machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: huliu-az410-99qcm-1 name: huliu-az410-99qcm-1-cfw6w namespace: openshift-machine-api ownerReferences: - apiVersion: machine.openshift.io/v1beta1 blockOwnerDeletion: true controller: true kind: MachineSet name: huliu-az410-99qcm-1 uid: bf8f7518-1fa9-4704-bdd7-6d0fde54e38e resourceVersion: "31287" uid: 303cf672-a2fa-44f3-8793-59801bb78902 spec: lifecycleHooks: {} metadata: {} providerSpec: value: apiVersion: machine.openshift.io/v1beta1 credentialsSecret: name: azure-cloud-credentials namespace: openshift-machine-api image: offer: "" publisher: "" resourceID: /resourceGroups/huliu-az410-99qcm-rg/providers/Microsoft.Compute/images/huliu-az410-99qcm sku: "" version: "" kind: AzureMachineProviderSpec location: eastus managedIdentity: huliu-az410-99qcm-identity metadata: creationTimestamp: null name: huliu-az410-99qcm networkResourceGroup: huliu-az410-99qcm-rg osDisk: diskSettings: {} diskSizeGB: 128 managedDisk: storageAccountType: Premium_LRS osType: Linux publicIP: false publicLoadBalancer: huliu-az410-99qcm resourceGroup: huliu-az410-99qcm-rg spotVMOptions: {} subnet: huliu-az410-99qcm-worker-subnet userDataSecret: name: worker-user-data vmSize: invalidStandard_D4s_v3 vnet: huliu-az410-99qcm-vnet zone: "3" status: conditions: - lastTransitionTime: "2022-10-08T07:42:28Z" status: "True" type: Drainable - lastTransitionTime: "2022-10-08T07:42:28Z" message: Instance has not been created reason: InstanceNotCreated severity: Warning status: "False" type: InstanceExists - lastTransitionTime: "2022-10-08T07:42:28Z" status: "True" type: Terminable errorMessage: 'failed to reconcile machine "huliu-az410-99qcm-1-cfw6w": failed to create vm huliu-az410-99qcm-1-cfw6w: failure sending request for machine huliu-az410-99qcm-1-cfw6w: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="BadRequest" Message="Virtual Machine cannot be created because both Availability Zone and Availability Set were specified. Deploying an Availability Set to an Availability Zone isn’t supported."' errorReason: InvalidConfiguration lastUpdated: "2022-10-08T07:42:35Z" phase: Failed providerStatus: conditions: - lastProbeTime: "2022-10-08T07:42:35Z" lastTransitionTime: "2022-10-08T07:42:35Z" message: 'failed to create vm huliu-az410-99qcm-1-cfw6w: failure sending request for machine huliu-az410-99qcm-1-cfw6w: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="BadRequest" Message="Virtual Machine cannot be created because both Availability Zone and Availability Set were specified. Deploying an Availability Set to an Availability Zone isn’t supported."' reason: MachineCreationFailed status: "True" type: MachineCreated metadata: {}
Actual results:
Created Availability Set for it.
Expected results:
Should not create Availability Set, as the region has zones.
Additional info:
If provided correct vmSize, the machine get Running and will not create Availability Set for it. Not sure why it will create Availability Set for it when vmSize is invalid. The issue can be reproduced both on 4.11 and 4.10 version, as Availability Set is introduced in 4.10. On 4.12, there is bug https://issues.redhat.com/browse/OCPBUGS-1871, will also check this on 4.12 when this bug get verified.
Description of problem:
To address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Already merged in https://github.com/openshift/cluster-kube-apiserver-operator/pull/1398
This is a clone of issue OCPBUGS-1130. The following is the description of the original issue:
—
The test results in sippy look really bad on our less common platforms, but still pretty unacceptable even on core clouds. It's reasonably often the only test that fails. We need to decide what to do here, and we're going to need input from the etcd team.
As of Sep 13th:
Even on some major variant combos, a 4-8% failure rate is too high.
On Sep 13 arch call (no etcd present), Damien mentioned this might be an upstream alert that just isn't well suited for OpenShift's use cases, is this the case and it needs tuning?
Has the problem been getting worse?
I believe this link https://datastudio.google.com/s/urkKwmmzvgo indicates that this may be the case for 4.12, AWS and Azure are both getting worse in ways that I don't see if we change the release to 4.11 where it looks consistent. gcp seems fine on 4.12. We do not have data for vsphere for some reason.
This link shows the grpc_methods most commonly involved: https://search.ci.openshift.org/?search=etcdGRPCRequestsSlow+was+at+or+above&maxAge=48h&context=7&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
At a glance: LeaseGrant, MemberList, Txn, Status, Range.
Broken out of TRT-401
For linking with sippy:
[bz-etcd][invariant] alert/etcdGRPCRequestsSlow should not be at or above info
[sig-arch][bz-etcd][Late] Alerts alert/etcdGRPCRequestsSlow should not be at or above info [Suite:openshift/conformance/parallel]
Description of problem:
Setting a telemeter proxy in the cluster-monitoring-config config map does not work as expected
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
the following KCS details steps to add a proxy.
The steps have been verified at 4.7 but do not work at 4.8, 4.9 or 4.10
https://access.redhat.com/solutions/6172402
When testing at 4.8, 4.9 and 4.10 the proxy setting where also nested under `telemeterClient`
which triggered a telemeter restart but the proxy setting do not get set in the deployment as they do in 4.7
Actual results:
4.8, 4.9 and 4.10 without the nested `telemeterClient`
does not trigger a restart of the telemeter pod
Expected results:
I think the proxy setting should be nested under telemeterClient
but should set the environment variables in the deployment
Additional info:
This is a backport of https://bugzilla.redhat.com/show_bug.cgi?id=2116382 from 4.12 to 4.11.z. Creating manually because as seen in https://github.com/openshift/cluster-monitoring-operator/pull/1743 `/cherry-pick` doesn't work for bugs originally created in bugzilla
Backport clone of https://issues.redhat.com/browse/OCPBUGSM-24281
openshift-4 tracking bug for telemeter-container: see the bugs linked in the "Blocks" field of this bug for full details of the security issue(s).
This bug is never intended to be made public, please put any public notes in the blocked bugs.
Impact: Moderate
Public Date: 11-Jan-2021
PM Fix/Wontfix Decision By: 04-May-2021
Resolve Bug By: 11-Jan-2022
In case the dates above are already past, please evaluate this bug in your next prioritization review and make a decision then. Remember to explicitly set CLOSED:WONTFIX if you decide not to fix this bug.
Please see the Security Errata Policy for further details: https://docs.engineering.redhat.com/x/9RBqB
This is a clone of issue OCPBUGS-1717. The following is the description of the original issue:
—
Description of problem:
Image registry pods panic while deploying OCP in me-central-1 AWS region
Version-Release number of selected component (if applicable):
4.11.2
How reproducible:
Deploy OCP in AWS me-central-1 region
Steps to Reproduce:
Deploy OCP in AWS me-central-1 region
Actual results:
panic: Invalid region provided: me-central-1
Expected results:
Image registry pods should come up with no errors
Additional info:
This is a clone of issue OCPBUGS-1549. The following is the description of the original issue:
—
Description of problem:
The cluster-dns-operator does not reconcile the openshift-dns namespace, which has been exposed as an issue in 4.12 due to the requirement for the namespace to have pod-security labels. If a cluster has been incrementally updated from a version less than or equal to 4.9, the openshift-dns namespace will most likely not contain the required pod-security labels since the namespace was statically created when the cluster was installed with old namespace configuration.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always if cluster originally installed with v4.9 or less
Steps to Reproduce:
1. Install v4.9 2. Upgrade to v4.12 (incrementally if required for upgrade path) 3. openshift-dns namespace will be missing pod-security labels
Actual results:
"oc get ns openshift-dns -o yaml" will show missing pod-security labels: apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/node-selector: "" openshift.io/sa.scc.mcs: s0:c15,c0 openshift.io/sa.scc.supplemental-groups: 1000210000/10000 openshift.io/sa.scc.uid-range: 1000210000/10000 creationTimestamp: "2020-05-21T19:36:15Z" labels: kubernetes.io/metadata.name: openshift-dns olm.operatorgroup.uid/3d42c0c1-01cd-4c55-bf88-864f041c7e7a: "" openshift.io/cluster-monitoring: "true" openshift.io/run-level: "0" name: openshift-dns resourceVersion: "3127555382" uid: 0fb4571e-952f-4bea-bc45-461beec54369 spec: finalizers: - kubernetes
Expected results:
pod-security labels should exist: labels: kubernetes.io/metadata.name: openshift-dns olm.operatorgroup.uid/3d42c0c1-01cd-4c55-bf88-864f041c7e7a: "" openshift.io/cluster-monitoring: "true" openshift.io/run-level: "0" pod-security.kubernetes.io/audit: privileged pod-security.kubernetes.io/enforce: privileged pod-security.kubernetes.io/warn: privileged
Additional info:
Issue found in CI during upgrade
https://coreos.slack.com/archives/C03G7REB4JV/p1663676443155839
Description of problem:
The reconciler removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources whether the pod is alive or not.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create pods and check the overlappingrangeipreservations.whereabouts.cni.cncf.io resources:
$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A NAMESPACE NAME AGE openshift-multus 2001-1b70-820d-4b0