SQL Server Transactional Replication Latency Check

SQL Server Replication Latency Checks

Table of Contents

SQL Server Transactional Replication Latency Check

SQL Server Replication latency checks are often unreliable.  Sure, you can automate tokens and the like, but in the end, you need to know “Am I synchronizing regularly?” and “Do I have a large number of undistributed commands?”

Requirements:

  • The Publication Settings for immediate_sync and allow_anonymous must be set to false
  • SQL Mail must be configured and working properly to send the notifications
  • SQL 2012 or higher.  This might work on prior versions of SQL, but I have not tested it on versions that old
  • Manual update of the Repl_Notification_Settings table when a publication is added or dropped

Setup:

The first step is to establish a table to hold the metadata for the checks we are doing. This table should be created on your Distributor server, in a utility database.  I call mine DB_Administration.

CREATE TABLE [dbo].[Repl_Notification_Settings](
	[DBName] [varchar](255) NULL,
	[PubName] [varchar](255) NULL,
	[Customer_Contact] [varchar](1000) NULL,
	[Internal_Contact] [varchar](1000) NULL,
	[Notify_CMD_Threshold] [int] NULL,
	[Notify_Time_Threshold_Sec] [int] NULL
) ON [PRIMARY]
  • You need the database name and publication name to identify the publication you would like to track. 
  • Customer_Contact is a list of e-mail addresses (separated by semicolons, just like normal) you wish to send if you want to notify folks outside of your organization. 
  • Internal_Contact is the list of e-mails internal to your organization you would like notified. 
  • Notify_CMD_Threshold is the number of undistributed commands that will trigger an alert. 
  • The Notify_Time_Threshold_Sec is the number of seconds without a successful synchronization that will trigger an alert.

Step two is to establish a view that will find the relevant data for you:

  CREATE VIEW [dbo].[Repl_UndistCmds]
  AS
 
  SELECT @@SERVERNAME as ServerName, ( CASE
           WHEN Mdh.Runstatus = '1' THEN 'Start'
           WHEN Mdh.Runstatus = '2' THEN 'Succeed'
           WHEN Mdh.Runstatus = '3' THEN 'InProgress'
           WHEN Mdh.Runstatus = '4' THEN 'Idle'
           WHEN Mdh.Runstatus = '5' THEN 'Retry'
           WHEN Mdh.Runstatus = '6' THEN 'Fail'
         END )                                 [Run status],
       Mda.Subscriber_db                       [Subscriber db],
       Mda.Publication                         [Pub name],
       Mdh.[Time]                              [Lastsynchronized],
       Und.Undelivcmdsindistdb                 [Undistcom],
       Mdh.Comments                            [Comments],
       DATEDIFF(SECOND, Mdh.[Time], GETDATE()) Lastrefresh
FROM   DISTRIBUTION.DBO.Msdistribution_agents Mda
       LEFT JOIN DISTRIBUTION.DBO.Msdistribution_history Mdh
              ON Mdh.Agent_id = Mda.Id
       JOIN (SELECT S.AGENT_ID,
                    Maxagentvalue.[Time],
                    SUM(CASE
                          WHEN Xact_seqno > Maxagentvalue.Maxseq THEN 1
                          ELSE 0
                        END	
						) AS Undelivcmdsindistdb
             FROM   DISTRIBUTION.DBO.Msrepl_commands T (NOLOCK)
                    JOIN DISTRIBUTION.DBO.Mssubscriptions AS S (NOLOCK)
                      ON ( T.ARTICLE_ID = S.ARTICLE_ID
                           AND T.PUBLISHER_DATABASE_ID = S.PUBLISHER_DATABASE_ID
                         )
                    JOIN (SELECT Hist.AGENT_ID,
                                 MAX(Hist.[TIME]) AS [Time],
                                 H.Maxseq
                          FROM   DISTRIBUTION.DBO.Msdistribution_history Hist (NOLOCK)
                                 JOIN (SELECT Agent_id,
                                              ISNULL(MAX(Xact_seqno), 0x0) AS
                                              Maxseq
                                       FROM
                                 DISTRIBUTION.DBO.Msdistribution_history (NOLOCK)
								 where runstatus in (2,3)
                                       GROUP  BY Agent_id
						 ) AS H
                                   ON ( Hist.AGENT_ID = H.Agent_id
                                        AND H.Maxseq = Hist.XACT_SEQNO )
                          GROUP  BY Hist.AGENT_ID,
                                    H.Maxseq
				) AS Maxagentvalue
                      ON Maxagentvalue.Agent_id = S.AGENT_ID
             GROUP  BY S.AGENT_ID,
                       Maxagentvalue.[Time]) Und
         ON Mda.Id = Und.Agent_id
            AND Und.[Time] = Mdh.[Time]

Finally, you need a job that will review this info and alert you on it at regular intervals.  This is the job step command that I use:

DECLARE Subs CURSOR FOR SELECT 
[subscriber db], pubname, lastsynchronized, undistcom, notify_cmd_threshold, 
notify_time_threshold_sec, internal_contact, customer_contact 
FROM db_administration.dbo.repl_undistcmds 
JOIN db_administration.dbo.Repl_Notification_Settings ON [pub name] = pubname 
OPTION(MAXDOP 1) -- Intra-Query Deadlocks sometimes got me without MAXDOP 1

DECLARE 
  @sub VARCHAR(255),
  @pub VARCHAR(255),
  @lastsync DATETIME,
  @undist BIGINT,
  @c_thr BIGINT,
  @t_thr BIGINT,
  @int_cont VARCHAR(max),
  @cust_cont VARCHAR(max),
  @subj VARCHAR(255)

OPEN subs
FETCH NEXT FROM subs INTO @sub, @pub, @lastsync, @undist, @c_thr, @t_thr, @int_cont, @cust_cont
WHILE @@FETCH_STATUS = 0
BEGIN
  DECLARE @msg VARCHAR(max)
  IF @undist> @c_thr OR datediff(ss, @lastsync, getdate()) > @t_thr
  BEGIN
	SET @subj = 'Replication Latency Alert for ' + @pub
    SET @MSG = 'The publication '  + @pub + ' with Subscriber DB Name ' + @sub +' has exceeded either the threshold for undistributed commands of ' 
	+ CONVERT(VARCHAR,@c_thr) + ' or the time threshold for last synchornization of ' +
    CONVERT(VARCHAR,dateadd(ss,-1*@t_thr,getdate())) + '.  
    Current undistributed commands = ' + CONVERT(VARCHAR,@undist) + ' and last synchronization was at ' + CONVERT(VARCHAR,@lastsync)
    EXEC msdb.dbo.sp_send_dbmail @recipients= @int_cont, @copy_recipients = @cust_cont, @subject = @subj, @body = @msg, @body_format = 'HTML'
                
  END
    FETCH NEXT FROM subs INTO @sub, @pub, @lastsync, @undist, @c_thr, @t_thr, @int_cont, @cust_cont
END

I named the job “Check for Undistributed Commands” and it runs on the distributor.  Give it a reasonable schedule so that you are notified before things become too dire.  The job will send a separate notification for each publication that is out of sync. 

I use the statement below to initially populate the metadata table when there is a large number of publications (make sure you replace the e-mail address):

insert Repl_Notification_Settings 
select distinct subscriber_db, publication, null, 'youremail@domain.com', 250, 600
from distribution..MSsubscriptions s join distribution..mspublications p on s.publication_id = p.publication_id
and publication not in (select PubName from Repl_Notification_Settings) and subscriber_db <> 'virtual'

I opt to set the thresholds low to start and notify only me.  Then once I feel like I have the settings dialed in, I adjust the notification e-mail addresses.

A note about replication agents:

By default, all of the replication agents are set to retry at 1-minute intervals for about 800 years (the retry number is the max int value).  This means that if one of those agents is failing and you rely on job failure alerts, you will never get alerted because the job itself never fails, it just keeps retrying.

There are a couple of options to work around this issue.  The first is to use the SQL Agent Alert module to set up alerts for Replication Agent Failure events for your Log Reader and your Distribution agents.  This works fine, but many shops already have some tooling built in to send notifications on job failures.  Be aware that if not adjusted from the default settings, this will send you a notification every minute when there is an issue (the retry interval for the jobs)

If you’d like to have the notifications for job failures do the work, adjust the jobs for Log Reader and Distribution agents as follows: 

                Add a new schedule that runs the job every 10 minutes.

                Schedule it for 10 retries, once per minute

This will provide continuous alerting that there is a problem without completely overwhelming your system.

Facebook
Twitter
LinkedIn
Email
Picture of Joe Fleming (Farmington, MI)
Joe Fleming (Farmington, MI)

Your SQL Server Lifeline: Immediate Fixes, Ongoing Optimization | Save Costs & Boost Efficiency | Custom Packages | Veteran Expertise

FREE CONSULTATION
Let SQL Tailor craft the perfect solution for your business.
Subscribe to Stay in Touch