|
It’s easy when you think about it !
Calculated Columns, persistence and indexes
Background:
v The code belongs to a process which runs every 10 minutes, 24 hours a day. v I picked this up from my monitoring as it was producing pronounced spikes on my performance graphs. v To explain – I have extensive monitoring on my production systems and although the production system is used 24 x 7 activity at weekends is much less, it was by my normal checks that I discovered this process. v ( See graph extracts below ) v The reality was that I identified a process running 6 times an hour and asked who’d instigated a process on that schedule – I was then able to examine the procedure ( see below ) v Note 1 - The code has been sanitised. v Note 2 - Some indexing was changed but made no significant improvement. v The rowcounts in the underlying tables are smallish, 10’s of thousands at most v In this instance we’re trying to eliminate any Regus emails because that will be an internal user, and we’re checking to see if there was an error when last processed.
There were two distinct problems with this query, it took too long to run and it used excessive io.
|
|
|
|
|
|
What’s the problem to resolve ? well here’s the sanitised query ( to protect the innocent !! )
( problem areas highlighted )
The column ErrorDescription is a ntext column.
NOTE: There’s actually several other issues within this query but I’m looking at the highlighted SARGS which can be resolved.
|
|
select con.PersonID ,con.Mail ,eer.RoleListCsv ,eer.AppListCsv ,@ImportDate ,ct.communicationMethodId ,ct.communicationTypeId ,1 ,NULL ,NULL from dbo.Priority eep inner join dbo.Rulez eer on eep.PriorityId = eer.PriorityId inner join dbo.Extraction cfee on cfee.CountryID = eer.CountryID inner join dbo.Companycv cov on cov.CompanyID = cfee.CompanyID inner join dbo.Company co on co.CompanyID = cov.CompanyID and cov.CompanyVersion = co.CompanyVersion and co.CtypeID = eep.CtypeID and co.ClassID = eep.ClassID inner join dbo.Personcv conv on conv.PersonID = cfee.PersonID inner join dbo.Person con on con.PersonID = conv.PersonID and con.PersonVersion = conv.PersonVersion inner join dbo.CommType ct on ct.CommTypeID = eer.CommTypeID where eep.Priority = @RulePriority and eep.ActiveFromDate <= getDate() and eer.ActiveFromDate <= getDate() and co.IsAccountBlocked = 0 and co.IsActive = 1 and con.IsActive = 1 and not exists (select * from dbo.PersonLogon cl where cl.PersonID = con.PersonID) and not exists (select * from dbo.PersonLogonImport cli where cli.PersonID = con.PersonID and Imported = 0 and ErrorDescription is not null) and len(con.Mail) > 6 and con.Mail not like '%@regus.com';
|
|
Now let’s work through an example you can run yourself.
1. A calculated column on a text column based around null
LOB reads can be expensive, generally you will be doubling up ( at least ) the io for each row, this is even more galling if you don’t actually want the data. You could also be bringing more data then you need into cache which in turn could be flushing data you actually need. In the production system the column ErrorDescription is of type NText
|
|
|
|
Create table |
|
CREATE TABLE [dbo].[TestTable1]( [NumKey] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL, [cGuid] [uniqueidentifier] NULL, [cSguid] [uniqueidentifier] NOT NULL, [cBig] [bigint] NULL, [cInt] [int] NULL, [cDate1] [datetime] NOT NULL, [cDate2] [datetime] NULL, [cVchar1] [varchar](50) NULL, [cChar1] [char](20) NULL, [cVchar2] [varchar](500) NULL, [cChar2] [char](100) NULL, [cText] [text] NULL, CONSTRAINT [PK_TestTable1] PRIMARY KEY CLUSTERED ( [NumKey] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]; GO ALTER TABLE [dbo].[TestTable1] ADD CONSTRAINT [DF_TestTable1_cGuid] DEFAULT (newid()) FOR [cGuid]; GO ALTER TABLE [dbo].[TestTable1] ADD CONSTRAINT [DF_TestTable1_cSguid] DEFAULT (newid()) FOR [cSguid];
|
|
Populate the table with 500,000 rows. |
|
declare @num bigint,@count int = 1,@count2 int = 12, @count3 int =1; -- while @count<500001 begin IF (@count%2)=0 begin insert into dbo.TestTable1(cBig,cInt,cDate1,cDate2,cVchar1,cChar1,cVchar2,cChar2,cText) values( 36854775807+(36854775807/@count),@count2*3,dateadd(dd,@count3,'25 MAY 2007'),dateadd(mi,@count,getdate()), reverse(convert(varchar(50),dateadd(dd,@count3,'25 MAY 2007'))),'0000'+convert(varchar(7),@count)+convert(varchar(4),@count3), replicate(convert(char(36),newid()),10),convert(varchar(20),datename(dw,getdate()+@count)+datename(month,getdate()+@count))+'blahblahblahblahblahblahblahblahblahblahblahblah' ,convert(char(6000),getdate())) end else begin insert into dbo.TestTable1(cBig,cInt, cDate1,cDate2,cVchar1,cChar1,cVchar2,cChar2,cText) values( 854775807-(36854775807/@count),@count2*2,dateadd(dd,@count3,'5 dec 2006'),dateadd(mi,@count,getdate()), reverse(convert(varchar(50),dateadd(dd,@count3,'25 MAY 2007'))),'0000'+convert(varchar(7),@count)+convert(varchar(4),@count3), replicate(convert(char(36),newid()),10),convert(varchar(20),datename(dw,getdate())+datename(month,getdate()))+'blahblahblahblahblahblahblahblahblahblahblahblah' ,null) end set @count = @count+1; set @count2 = @count2+2; IF @count2>50 set @count2 = 12; IF (@count%3000)=0 set @count3 = @count3+1;
end; go
|
|
v The table will have 500,000 rows with half of them having null for column cText
v If we simulate the ErrorDescription is not null part of the query
|
|
select top 1000 NumKey from dbo.TestTable1 where cText is not null order by NumKey asc;
|
|
(1000 row(s) affected) Table 'TestTable1'. Scan count 1, logical reads 164, physical reads 0, read-ahead reads 0, lob logical reads 1000, lob physical reads 0, lob read-ahead reads 0
|
|
|
|
|
|
The actual table reads are minimal but the LOB reads are significant, the 1,000 integers returned in the result set actually require 609MB in buffer cache, 1,000 ints technically only require 4k of storage. Hmm 4kb vs 609mb, not good.
Now if we add a calculated column to hold a 1 when cText is null this should be able to optimise the query.
|
|
-- -- this is wrong – but do you know why? -- alter table dbo.TestTable1 add ErrorIsNull as ISNULL(cText,'1') persisted; -- |
|
You cannot directly specify the data type for a calculated column, here you can see that I actually end up with another text column bacause the calculated column takes the underlying data type from the query which is still a Text column. Further more only the nulls are changed to ‘1’ the results in the non null columns are exactly as in cText column.
|
|
|
|
What I actually want is a bit column set to 1 or 0, here’s what’s required
|
|
alter Table dbo.TestTable1 add ErrorIsNull as convert(bit,case when cText IS NULL then '1' else '0' end) persisted;
|
|
|
|
v Because I want to index the column, even if 50% of the results are null, the column must be persisted.
v I want to create an index on the NumKey and persisted column because that will avoid using the primary key
|
|
create index idx_TestTable1_ErrorIsNull on dbo.TestTable1(ErrorIsNull,Numkey);
|
|
select top 1000 NumKey from dbo.TestTable1 where cText is not null order by NumKey asc;
|
|
(1000 row(s) affected) Table 'TestTable1'. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
|
|
|
|
|
|
What this shows is that when you’re looking at a query, the page reads are not always the whole story. The original query required 609 mb in cache, the calculated column 3 mb, even on a server with 64GB of ram this difference could be significant. If your database doesn’t all sit in cache then queries like this can cause increased physical io and degrade performance For this simple test there’s no real difference in execution time, in the production query there was a measurable decrease in execution time when implementing the persisted columns.
Here’s how to handle the email address sarg
|
|
alter table dbo.TestTable1 add [IsRegusEmail] AS (CONVERT([bit],case when [email] like '%@regus.com' then '1' else '0' end,0)) PERSISTED;
|
|
Query to return top 20 indexes in buffer cache |
|
-- -- top 20 indexes in buffer cache for the given database. -- this reports for the database you run this against -- SELECT top 20 obj.[name]as "Table Name" ,obj.index_id ,si.name,convert(numeric(10,2),(count(*)*8)/1024.0) AS "cached size (mb)" FROM sys.dm_os_buffer_descriptors AS bd INNER JOIN ( SELECT object_name(object_id) AS name ,index_id ,allocation_unit_id FROM sys.allocation_units AS au INNER JOIN sys.partitions AS p ON au.container_id = p.hobt_id AND (au.type = 1 OR au.type = 3) UNION ALL SELECT object_name(object_id) AS name ,index_id, allocation_unit_id FROM sys.allocation_units AS au INNER JOIN sys.partitions AS p ON au.container_id = p.hobt_id AND au.type = 2 ) AS obj ON bd.allocation_unit_id = obj.allocation_unit_id join sys.indexes si on si.index_id = obj.index_id and si.[object_id] = object_id(obj.name) WHERE bd.database_id = db_id() GROUP BY obj.name, obj.index_id,si.name ORDER BY "cached size (mb)" DESC;
© www.grumpyolddba.co.uk September 2010
|