Change Tracking Impact – Heavy OLTP Workload (spoiler, there is impact)

(cheeky chap)

Currently, I am interested in the impact of enabling change tracking on an environment with a heavy OLTP workload. The current scope is to determine the reduction in performance and the additional storage required by any underlying internal objects associated with change tracking. As per my earlier post about Extended Events, I will perform some HammerDB tests. As I will not be using “keying and thinking time”, this will represent a best effort workload where HammerDB drives SQL Server to its peak performance. This is likely not representative of many OLTP workloads, as not every workload attempts to drive SQL Server to its limit in an artificial manner. Still, we will measure the impact regardless.

High-Level Goals

The high-level goals were as follows:

  • Measure the performance impact on the HammerDB results
  • Measure (roughly) the storage consumed by any internal objects associated with Change Tracking

I will also preface the above with the disclaimer that I am not an expert in Change Tracking, so some config or detail may have been missed. As always, make sure you put in the work and do your own testing on your application before enabling change tracking.

The Workload

Due to the DeWitt clause, the actual Transactions Per Minute and New Orders Per Minute results will not be published; only the performance change over a baseline workload with change tracking (CT) enabled.

Here is the set of tests:

  • 1 Virtual user – (No keying and thinking time)
  • 4 Virtual users – (No keying and thinking time)
  • 8 Virtual users – (No keying and thinking time)
  • 16 Virtual users – (No keying and thinking time)
  • 32 Virtual users – (No keying and thinking time)

For each configuration, I ran five workloads to give me statistically valid results using a 95% confidence interval. The TPCC database was built specifying 640 warehouses and was around 90GB in size. The two VMs involved in the testing were an application server with 8 vCPUs and 24GB of RAM, and the database server was a 16-core box with 32GB of memory. The storage was NVME. I specified 1 minute of ramp-up time with a test period of 4 minutes, with a checkpoint occurring at the end. I also built in an extra minute after each run, as I was using autopilot and wanted the servers to settle down between runs.

The version of SQL Server was SQL Server 2022 (16.0.4215.2).

Change Tracking Configuration

Once the warehouse was built and the HammerDB best practice config changes were deployed, the following CT config was setup:

ALTER DATABASE TPCC
SET CHANGE_TRACKING = ON
(CHANGE_RETENTION = 7 DAYS, AUTO_CLEANUP = ON);

ALTER TABLE district
ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON);

ALTER TABLE item
ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON);

ALTER TABLE stock
ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON);

ALTER TABLE warehouse
ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON);

Note that only some of the tables were eligible for CT as a primary key is required.

Impact on the Application

Below is the impact on the HammerDB workload with CT enabled. Both Transactions Per Minute and New Orders Per Minute were recorded:

As we can see, the impact on performance was massive once the number of Virtual Users went past 4. Again, I emphasise that this is an artificial workload and your application may not be hit as hard.

Whilst the testing was in progress, it was abundantly clear that performance was much worse, so at 32 Virtual Users, I decided to run sp_blitzfirst over a 5-minute test period to see what SQL Server was waiting on:

Looks like we have excessive PAGELATCH_** waits. As a sanity check, I turned off CT and re-ran a 32 Virtual User workload to check that this is not present on the baseline run:

As we can see above, the buffer latch waits are absent when CT was switched off.

I decided to bust open the page at the heart of the contention using DBCC PAGE:

-- I identified the resource using this query (link at the end of the post)

SELECT wt.session_id,
       wt.wait_type,
       er.last_wait_type AS last_wait_type,
       wt.wait_duration_ms,
       wt.blocking_session_id,
       wt.blocking_exec_context_id,
       resource_description
FROM sys.dm_os_waiting_tasks AS wt
     INNER JOIN sys.dm_exec_sessions AS es
         ON wt.session_id = es.session_id
     INNER JOIN sys.dm_exec_requests AS er
         ON wt.session_id = er.session_id
WHERE es.is_user_process = 1
      AND wt.wait_type <> 'SLEEP_TASK'
ORDER BY wt.wait_duration_ms DESC;

--Looking at the page

DBCC TRACEON(3604); 
DBCC PAGE(5, 1, 5631744, 3) WITH TABLERESULTS;

The contents of the page looked like change tracking’s internal tables, so I have decided to put this new contention down to the CT.

Other metrics I captured via sp_blitzfirst were as follows:

  • CT off CPU – 58%
  • CT on CPU – 18%
  • CT off batch reqs/sec – 25808.44
  • CT on batch reqs/sec – 4295.64

The amount of work SQL Server is getting through has tanked.

Impact on Storage

I must confess I spent very little time on this aspect once I realised how epic the actual impact of CT on the application was. Nevertheless, I ran this query courtesy of Brent Ozar:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
GO
select sct1.name as CT_schema,
sot1.name as CT_table,
ps1.row_count as CT_rows,
ps1.reserved_page_count*8./1024. as CT_reserved_MB,
sct2.name as tracked_schema,
sot2.name as tracked_name,
ps2.row_count as tracked_rows,
ps2.reserved_page_count*8./1024. as tracked_base_table_MB,
change_tracking_min_valid_version(sot2.object_id) as min_valid_version
FROM sys.internal_tables it
JOIN sys.objects sot1 on it.object_id=sot1.object_id
JOIN sys.schemas AS sct1 on
sot1.schema_id=sct1.schema_id
JOIN sys.dm_db_partition_stats ps1 on
it.object_id = ps1. object_id
and ps1.index_id in (0,1)
LEFT JOIN sys.objects sot2 on it.parent_object_id=sot2.object_id
LEFT JOIN sys.schemas AS sct2 on
sot2.schema_id=sct2.schema_id
LEFT JOIN sys.dm_db_partition_stats ps2 on
sot2.object_id = ps2. object_id
and ps2.index_id in (0,1)
WHERE it.internal_type IN (209, 210);
GO

Here is the output after 2 plus hours of load testing:

  • District table is 50MB but has 1792MB of CT reserved storage
  • Stock table is 27GB but has 11GB of CT reserved storage
  • Warhouse table is 5MB but has 965MB of CT reserved storage

As I have not really used CT that much, I do not intuitively know if these are extreme values or not compared to the base tables. Please draw your own conclusions. My hunch says this is excessive (but HammerDB is an extreme load test).

Does cleanup impact things? Absolutely no idea, as I did not get that far. However, I have seen out in the field that the cleanup job can get blocked by an application. It is also likely that the reverse is also possible, so ensure you test things.

Things to do differently next time

Here are the things I would do if I had more time:

  • Track the CPU usage properly
  • Use keying and thinking time to simulate a more gentle workload
  • Explore the cleanup task
    • What sort of impact does it cause (if any)
  • Measure the impact on the transaction log

Conclusion

On a heavy OLTP workload, Change Tracking impacted performance in escalating severities as the number of concurrent processes increased. At extreme levels, excessive latch contention was observed on internal objects associated with Change Tracking.

My main takeaway is that your application needs to be tested in a reasonably representative environment before enabling CT, especially on busy tables.

Resources

https://www.brentozar.com/archive/2014/06/performance-tuning-sql-server-change-tracking

https://www.brentozar.com/askbrent

https://learn.microsoft.com/en-us/sql/relational-databases/diagnose-resolve-latch-contention?view=sql-server-ver17

One comment

Leave a Reply

Discover more from eheaton.com

Subscribe now to keep reading and get access to the full archive.

Continue reading