Experience: Cloud Databases and On-Premises Proxies

assets/images/posts/network.png

As a follow-up on Using Oracle PDBs in Continuous Integration pipelines, I have been struggling with one of those “works on my machine” nightmares. The ask was simple, centralise the ad-hoc testing from the developer machines to an automated CI pipeline using the Oracle Cloud Infrastructure (OCI) Oracle Base Database Service (OBDS) as the backend.

I proceeded:

  • Replicate the local Database’s version, initialisation, profiles, users, tablespaces, etc. into OBDS
  • Create an image of the local client setup
  • Create the PDB based pipeline to automate the tests on a code push

After completing the above, a manual CI run was performed using a subset of the tests focused on the database. The tests results were identical between the local tests and the cloud infrastructure tests… Job Done!

Unfortunately, after enabling the full suite of tests, things went from “alright” to “FFS”:

Full Local Tests: Executed 2707 of 2878 specs [(1 ERROR) (837 FAILED) (171 PENDING)

Full Cloud Tests: Executed 2707 of 2878 specs [(99 ERRORS) (1976 FAILED) (171 PENDING)

Ouch!

The Network

One main difference between the testing architectures that I could not replicate was the network. Locally, network traffic never traversed outside the developer’s machine when testing, whereas with the database in the cloud, traffic was going between the CI/CD runner and OCI… sprinkled with the extra flavour of a proxy. So I revisited the network.

The proxy setup was good; communication between the runner and the database, through the proxy, was working as expected (and DB tests proved that). What about network isolation? The developers were running tests in an isolated networking environment, while in the OCI setup, they were “sharing” a bridged network when concurrent tests ran. Was it possible concurrently running tests were impacting each other?

it’s never the database, is it?

Disabling concurrent CI jobs did seem to have an impact, not as much as I hoped, but enough to shift my focus to network isolation between the client containers. After continually tweaking the container engine’s network then running and re-running tests, I hit a brick wall. Nothing was improving… time to shift focus again. I revisited the client, the CI pipeline, the CI/CD software, and finally the database (it’s never the database, is it?).

The Database

I took a step back and asked for the full logs of the test results from a developers local test run to compare with those from the CI automation. After pouring over the differences in the results (a task within itself), one thing that stood out was: ORA-03114: not connected to ORACLE. As the error appeared in both results, I initially discounted as insignificant, but it did occur considerably more in the Cloud Test. ORA-03114 is a generic error that could be triggered by from a code problem to a database bug which terminates a session. The alert.log showed no errors, the listener log showed no errors, so I ticked it off as a code race condition, due to the more performant database, which will need the developers attention.

Sessions came in, did stuff, went idle, re-engaged… business as usual

As a last ditch effort, I decided to enable SQL*Net tracing on both the database server and client, restarted the full suite of tests and manually logged into the database from the CI runner (via the proxy) to watch the sessions by querying different dictionary views. Sessions came in, did stuff, went idle, re-engaged… business as usual. Giving my bleeding eyes some rest, I stepped away for look at something less tedious.

On coming back to my SQL session 5 minutes later, my next query resulted in ORA-03114: not connected to ORACLE! Again, nothing in the alert.log or listener.log indicating a database problem but I could now reproduce the error without running a full 2 hour test suite. The SQL*Net traces didn’t shed any light, they had just stopped writing at one point and only the client trace produced output when I had run my query resulting in the ORA-03114.

I know ORA-03114 shouldn’t be triggered from the IDLE_TIME profile parameter, but I couldn’t discount anything. IDLE_TIME set to UNLIMITED, not that. Maybe it was the listeners SQLNET.EXPIRE_TIME which implements Dead Connection Detection. On OBDS, this is set to 10 minutes and I only stepped away for 5 minutes, so shouldn’t be that.

With a bit of experimenting, I determined the error occurred after 2 minutes of idle time but I couldn’t think of anything set to end connections after 2 minutes. The ah ha moment occurred when I tried to reproduce the error from the database machine itself, 2 minutes of idleness came and went, no errors…

The Proxy

In hindsight, the proxy was an obvious major infrastructure difference that should have been given more attention from the start, but the clients were configured to go through the proxy, tested, and they all worked.

The subset of CI tests which focused on the database worked, however, they were focused on the database and didn’t leave sessions idling. Concurrent CI runs, with their varying results, were the consequence of client resource contention on the CI/CD runner which impacted the times between tests. One run would take 1.5 minutes between tests, while another took 2.1 minutes… one success, one failure of the same test.

Admittedly, I sort of ignored the proxy as it was not a component I had any control over and as I don’t have control over it, I looked at everything else. Now that I know it is the proxy, what can I do to resolve this issue of it timing out the idle sessions?

Dead Connection Detection (DCD)

A Workaround: This is not the end solution, but a stop gap until the application code is updated to catch the ORA-03114 and deal with it. This workaround allows us to progress with moving CI tests off local developer machines to a centralised CI/CD solution while that fix is being worked on. The workaround will eventually be reverted.

The workaround takes advantage of the pre-12c DCD in which the database “pings” the client based on SQLNET.EXPIRE_TIME and the client responds indicating it is still alive. This ensures that there is traffic between the database and client, keeping the link alive until the client intentionally disconnects. It is important to note that DCD changed its behaviour in 12c+ and the new mechanism does not work for this purpose. To revert to the pre-12c DCD mechanism in 12c+ databases, set USE_NS_PROBES_FOR_DCD.

On the database server, update the sqlnet.ora and adjust the DCD settings:

SQLNET.EXPIRE_TIME=1 # Send a ping every minute
USE_NS_PROBES_FOR_DCD=true # Use the pre-12c DCD mechanism

The connect string from the client to the server also needs a small modification, notably the (enable=broken):

  (DESCRIPTION=
    (enable=broken)
    (ADDRESS=
      (PROTOCOL=TCP)(HOST=<CDB_HOST>)(PORT=1521)
      (https_proxy=<PROXY_HOST>)(https_proxy_port=<PROXY_PORT>)
    )
    (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=<PDB>))
  )

(BTW, the above also shows how to connect SQL*Net through a proxy)

Conclusion

Executed 2707 of 2878 specs [(0 ERROR) (730 FAILED) (179 PENDING)

Result! (and, of course, it never is the database)


Comments