I’ve noticed that of late I’ve become a bit more critical of a well known publication that I suspect many DBA’s read. I have subscriptions to a number of publications and for the ones that I pay for I’m generally quite content.
I like paper / hard copy because I can make use of time where it’s just not practical to have a laptop or PC – and I spend my working day in front of a screen so sometimes it’s nice to escape to the garden or the pub or on the train, say travelling to a SSUG meeting without a PC.
So what’s set GrumpyOldDba off now you may ask? Well there’s an article been published about removing duplicate indexes – the Microsoft guys blogged this, with code, ages ago and I spent some time looking at this type of analysis as part of my index analysis series.
However it’s not quite so easy as it sounds and leaving aside duplicated single column non clustered indexes, which are obviously bad, the actual matter of where indexes have the same columns but in different orders or contained within is quite a complex matter and worth a whole lot of time on its own.
( Here’s the original blog from June 2007 http://blogs.msdn.com/mssqlisv/archive/2007/06/29/detecting-overlapping-indexes-in-sql-server-2005.aspx)
|
Idx_anytable_index1 |
SalesID |
PropertyID |
SaleStartDate |
SaleEndDate |
UserID |
IsCurrent |
|
Idx_anytable_index2 |
SalesID |
UserID |
PropertyID |
SaleStartDate |
SaleEndDate |
IsCurrent |
|
Idx_anytable_index3 |
SalesID |
PropertyID |
SaleEndDate |
IsCurrent |
|
|
|
Idx_anytable_index4 |
PropertyID |
SalesID |
SaleStartDate |
SaleEndDate |
UserID |
IsCurrent |
Taking the above indexes to be all standard non clustered indexes we can say that there’s a fair bit of duplication and in fact index 3 is already contained within index 1
Sadly with anything like this it just depends, and generally I’d say that index 3 is probably unwanted – however it may cover a query exactly, in which case index 1 might be ignored, and in certain joins one index may actually be preferred over another.
So what I’m saying is that other than basic duplicates it’s actually quite tricky, obviously you can check the dmv’s ( sql 2005 onwards ) to actually see if the index is being used, if you’re using 2000 or 7.0 then you can’t.
So what really got me worked up was the statement that creating a secondary index over a clustered index was a bad idea, now to me that such an article appears in this magazine is very disappointing and I’m also disappointed with the author – to me it shows a lack of editorial knowledge and like the review of the backup utilities last month isn’t what I’d expect from such a respected magazine. I’m also frankly becoming less and less impressed by the fact that many articles are only partly published and you have to go to the website for the full article. Maybe it’s a sign of the times that we don’t actually want paper copy any more.
Anyway here’s a very simple example of secondary indexes over the top of clustered indexes. The table I used has 2 million rows and is just under 16gb in size ( Let’s do real world examples here <grin> ) It’s a variation of the table in used in “Analysing Index Part 4 – Size does matter”
http://www.grumpyolddba.co.uk/sql2005/working%20with%20indexes%204.mht
This is the test table and the code used to populate it. You might want to find other ways of populating the table if you’re not running on a server as I figure this will run for a few hours on a laptop! There were distinct reasons why I wanted to generate data this way for some previous tests, the only important factor for this tests is that each row should exceed half a page so as to force one row per page.
|
create table dbo.TestTable ( NumKey int identity(1,1)not null, cGuid uniqueidentifier default newid()null, cSguid uniqueidentifier default NEWSEQUENTIALID() not null, cBig bigint null, cInt int null, cDate1 datetime not null, cDate2 datetime null, cVchar1 varchar(50) null, cChar1 char(20) null, cVchar2 varchar(500) null, cChar2 char(100), Pad1 char(4000) not null ); go -- -- set nocount on -- declare @num bigint,@count int,@count2 int,@count3 int set @count = 1 set @count2 = 12 set @count3 = 1 while @count<2000001 begin IF (@count%2)=0 begin insert into dbo.TestTable(cBig,cInt,cDate1,cDate2,cVchar1,cChar1,cVchar2,cChar2,pad1) values( 36854775807+(36854775807/@count),@count2*3,dateadd(dd,@count3,'25 MAY 2007'),dateadd(mi,@count,getdate()), reverse(convert(varchar(50),dateadd(dd,@count3,'25 MAY 2007'))),'0000'+convert(varchar(7),@count)+convert(varchar(4),@count3), replicate(convert(char(36),newid()),10),convert(varchar(20),datename(dw,getdate()+@count)+datename(month,getdate()+@count)) +'blahblahblahblahblahblahblahblahblahblahblahblah', replicate(convert(char(36),newid()),110)) end else begin insert into dbo.TestTable(cBig,cInt, cDate1,cDate2,cVchar1,cChar1,cVchar2,cChar2,pad1) values( 854775807-(36854775807/@count),@count2*2,dateadd(dd,@count3,'5 dec 2006'),dateadd(mi,@count,getdate()), reverse(convert(varchar(50),dateadd(dd,@count3,'25 MAY 2007'))),'0000'+convert(varchar(7),@count)+convert(varchar(4),@count3), replicate(convert(char(36),newid()),10),convert(varchar(20),datename(dw,getdate())+datename(month,getdate())) +'blahblahblahblahblahblahblahblahblahblahblahblah', replicate(convert(char(36),newid()),110)) end set @count = @count+1; set @count2 = @count2+2; IF @count2>50 set @count2 = 12; IF (@count%3000)=0 set @count3 = @count3+1;
end
|
Ø Here’s the output from sp_spaceused
![]()
Ø and from sys.dm_db_index_physical_stats
![]()
Ø Here’s my table after adding a clustered primary key on the integer identity column, NumKey
![]()
Ø Extracting index information from sys.dm_db_index_physical_stats
|
select index_id,index_type_desc,index_depth,index_level,page_count,record_count,min_record_size_in_bytes, max_record_size_in_bytes,avg_record_size_in_bytes from sys.dm_db_index_physical_stats (db_id(), object_id('testtable'), null ,null ,'detailed' )
|
And the results
|
index_id |
index_type_desc |
index_depth |
index_level |
page_count |
record_count |
min_record_size_in_bytes |
max_record_size_in_bytes |
avg_record_size_in_bytes |
|
1 |
CLUSTERED |
4 |
0 |
2000000 |
2000000 |
4577 |
4577 |
4577 |
|
1 |
CLUSTERED |
4 |
1 |
3218 |
2000000 |
11 |
11 |
11 |
|
1 |
CLUSTERED |
4 |
2 |
8 |
3218 |
11 |
11 |
11 |
|
1 |
CLUSTERED |
4 |
3 |
1 |
8 |
11 |
11 |
11 |
Select 50 rows from the table by PK
select * from dbo.TestTable where NumKey between 50 and 100
(51 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 56, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
This is fair enough because at 4,577 bytes per row it requires 51 page requests to return the data rows
|
The entire row |
Just the Numkey column |
The in statement
|
Ø If we just select the integer PK column it still requires the same io to return the result set
select numkey from dbo.TestTable where NumKey between 50 and 100
(51 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 56, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Ø Just out of pure interest and as part of another test I was running I dropped the 51 values into an IN statement
select numkey from dbo.testtable where numkey
in (50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,
71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,
92,93,94,95,96,97,98,99,100)
(51 row(s) affected)
Table 'TestTable'. Scan count 51, logical reads 204, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Ø The same for two values?
select numkey from dbo.testtable where numkey in (50,51)
(2 row(s) affected)
Table 'TestTable'. Scan count 2, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
So that’s making use of the clustered PK on the integer column, returning just the integer value.
Now create a non clustered secondary index over the PK
create index idx_TestTable_Numkey on dbo.TestTable(Numkey);
go
|
index_id |
index_type_desc |
index_depth |
index_level |
page_count |
record_count |
min_record_size_in_bytes |
max_record_size_in_bytes |
avg_record_size_in_bytes |
|
1 |
CLUSTERED INDEX |
4 |
0 |
2000000 |
2000000 |
4577 |
4577 |
4577 |
|
1 |
CLUSTERED INDEX |
4 |
1 |
3218 |
2000000 |
11 |
11 |
11 |
|
1 |
CLUSTERED INDEX |
4 |
2 |
8 |
3218 |
11 |
11 |
11 |
|
1 |
CLUSTERED INDEX |
4 |
3 |
1 |
8 |
11 |
11 |
11 |
|
2 |
NONCLUSTERED INDEX |
3 |
0 |
1732 |
2000000 |
5 |
5 |
5 |
|
2 |
NONCLUSTERED INDEX |
3 |
1 |
4 |
1732 |
11 |
11 |
11 |
|
2 |
NONCLUSTERED INDEX |
3 |
2 |
1 |
4 |
11 |
11 |
11 |
Now the original query
select numkey from dbo.TestTable where NumKey between 50 and 100
(51 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Now in the overall scheme of things 56 to 3 io may not seem very much, but if this query runs maybe several times a second then there could be some big savings.
select numkey from dbo.testtable where numkey
in (50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,
71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,
92,93,94,95,96,97,98,99,100)
(51 row(s) affected)Table 'TestTable'. Scan count 51, logical reads 153, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
And for the IN Down 25%.
|
|
Conclusion: To assume that the Primary Key will only be used for singleton selects is “strange” – and to assume that the PK will always be the clustered index is also unhelpful. In real world applications it is not uncommon to find wide Primary Keys and wide clustered indexes, all on wide tables. Which ever way you cut it the clustered index is always the width of the entire row and io will reflect that. Where the Clustered Primary Key is several columns wide it’s more than likely you’ll find range selects against it and here a covering non clustered index can save you considerable io. Not all databases are the same and what holds true for one may certainly not hold true for another, but sweeping generalisations are a problem, especially when they’re in a respected publication. It’s also difficult for the DBA who has to try to explain to a Manager why what’s in such an article is wrong, I remember unsuccessfully spending over a year trying to explain what the lock waits counters actually represented to senior management because someone had read they actually meant something else and they had convinced themselves this was true – it wasn’t ! |
© www.grumpyolddba.co.uk