SQL Staging Tables: Primary Key Clustered or Heap

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
5
down vote

favorite

We are taking legacy flat txt files and inserting them into stage tables with SSIS.The question arose whether table should have primary clustered key index or not. This is direct flat file import with no transformation.

create table dbo.CustomerTransaction
(
 CustomerName varchar(255),
 PurchaseLocation varchar(255),
 Productid int,
 AmountSold float,
 CustomerAddress varchar(50)
)

create table dbo.CustomerTransaction
(
 -- discussion for adding this column
 CustomerTransactionId int primary key clustered identity(1,1) 

 CustomerName varchar(255),
 PurchaseLocation varchar(255),
 Productid int,
 AmountSold float,
 CustomerAddress varchar(50)
)

-- both tables have nonclustered indexes
create nonclustered index idx_ProductId on dbo.CustomerTransaction(ProductId)
create nonclustered index idx_CustomerAddress on dbo.CustomerTransaction(CustomerAddress)

-- Actually have more indexes, tables above are just for sample

1) Before ETL, the staging tables are truncated. There are No Deletes and No Updates. Only Inserts.

truncate table dbo.[CustomerTransaction]

2) Then disable all indexes before ETL.

alter index all on dbo.[CustomerTransaction] DISABLE

3) We conduct SSIS data flow with default fast load, which I read is equivalent to bulk insert. No transformations occur here.

4) Then reenable all indexes after import is done.

alter index all on dbo.[CustomerTransaction] REBUILD

5) The staging tables are then selected on join and where clauses, and placed into datawarehouse. This we why we have nonclustered indexes. After data warehouse is loaded, we truncate the staging tables.

We are hearing information that ETL Stage tables are good as heaps. However, also learning of fragmentation and performance issues with heaps. Reading all the articles below

I am reading conflicting opinions. One says Binary tree clustered are maintenance headaches for import ETL. Other says Heaps have performance issues with fragmentation. Our performance testing does not show much difference, but our data may change later. So we need to make a good design decision.

https://sqlsunday.com/2016/09/01/compelling-case-for-heaps/

https://www.mssqltips.com/sqlservertip/4961/sql-server-insert-performance-for-clustered-indexes-vs-heap-tables/

http://kejser.org/clustered-indexes-vs-heaps/

https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-heaps-and-their-fragmentation/

We know a good reason to have identity is for row labelling, however question is mostly about internals and performance.

edited Nov 10 at 16:57

asked Nov 10 at 9:29

knightbob472

6613

add a comment |

up vote
5
down vote

favorite

create table dbo.CustomerTransaction
(
 CustomerName varchar(255),
 PurchaseLocation varchar(255),
 Productid int,
 AmountSold float,
 CustomerAddress varchar(50)
)

create table dbo.CustomerTransaction
(
 -- discussion for adding this column
 CustomerTransactionId int primary key clustered identity(1,1) 

 CustomerName varchar(255),
 PurchaseLocation varchar(255),
 Productid int,
 AmountSold float,
 CustomerAddress varchar(50)
)

-- both tables have nonclustered indexes
create nonclustered index idx_ProductId on dbo.CustomerTransaction(ProductId)
create nonclustered index idx_CustomerAddress on dbo.CustomerTransaction(CustomerAddress)

-- Actually have more indexes, tables above are just for sample

1) Before ETL, the staging tables are truncated. There are No Deletes and No Updates. Only Inserts.

truncate table dbo.[CustomerTransaction]

2) Then disable all indexes before ETL.

alter index all on dbo.[CustomerTransaction] DISABLE

3) We conduct SSIS data flow with default fast load, which I read is equivalent to bulk insert. No transformations occur here.

4) Then reenable all indexes after import is done.

alter index all on dbo.[CustomerTransaction] REBUILD

We are hearing information that ETL Stage tables are good as heaps. However, also learning of fragmentation and performance issues with heaps. Reading all the articles below

https://sqlsunday.com/2016/09/01/compelling-case-for-heaps/

https://www.mssqltips.com/sqlservertip/4961/sql-server-insert-performance-for-clustered-indexes-vs-heap-tables/

http://kejser.org/clustered-indexes-vs-heaps/

https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-heaps-and-their-fragmentation/

We know a good reason to have identity is for row labelling, however question is mostly about internals and performance.

edited Nov 10 at 16:57

asked Nov 10 at 9:29

knightbob472

6613

add a comment |

up vote
5
down vote

favorite

create table dbo.CustomerTransaction
(
 CustomerName varchar(255),
 PurchaseLocation varchar(255),
 Productid int,
 AmountSold float,
 CustomerAddress varchar(50)
)

create table dbo.CustomerTransaction
(
 -- discussion for adding this column
 CustomerTransactionId int primary key clustered identity(1,1) 

 CustomerName varchar(255),
 PurchaseLocation varchar(255),
 Productid int,
 AmountSold float,
 CustomerAddress varchar(50)
)

-- both tables have nonclustered indexes
create nonclustered index idx_ProductId on dbo.CustomerTransaction(ProductId)
create nonclustered index idx_CustomerAddress on dbo.CustomerTransaction(CustomerAddress)

-- Actually have more indexes, tables above are just for sample

1) Before ETL, the staging tables are truncated. There are No Deletes and No Updates. Only Inserts.

truncate table dbo.[CustomerTransaction]

2) Then disable all indexes before ETL.

alter index all on dbo.[CustomerTransaction] DISABLE

3) We conduct SSIS data flow with default fast load, which I read is equivalent to bulk insert. No transformations occur here.

4) Then reenable all indexes after import is done.

alter index all on dbo.[CustomerTransaction] REBUILD

We are hearing information that ETL Stage tables are good as heaps. However, also learning of fragmentation and performance issues with heaps. Reading all the articles below

https://sqlsunday.com/2016/09/01/compelling-case-for-heaps/

https://www.mssqltips.com/sqlservertip/4961/sql-server-insert-performance-for-clustered-indexes-vs-heap-tables/

http://kejser.org/clustered-indexes-vs-heaps/

https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-heaps-and-their-fragmentation/

We know a good reason to have identity is for row labelling, however question is mostly about internals and performance.

edited Nov 10 at 16:57

asked Nov 10 at 9:29

knightbob472

6613

create table dbo.CustomerTransaction
(
 CustomerName varchar(255),
 PurchaseLocation varchar(255),
 Productid int,
 AmountSold float,
 CustomerAddress varchar(50)
)

create table dbo.CustomerTransaction
(
 -- discussion for adding this column
 CustomerTransactionId int primary key clustered identity(1,1) 

 CustomerName varchar(255),
 PurchaseLocation varchar(255),
 Productid int,
 AmountSold float,
 CustomerAddress varchar(50)
)

-- both tables have nonclustered indexes
create nonclustered index idx_ProductId on dbo.CustomerTransaction(ProductId)
create nonclustered index idx_CustomerAddress on dbo.CustomerTransaction(CustomerAddress)

-- Actually have more indexes, tables above are just for sample

1) Before ETL, the staging tables are truncated. There are No Deletes and No Updates. Only Inserts.

truncate table dbo.[CustomerTransaction]

2) Then disable all indexes before ETL.

alter index all on dbo.[CustomerTransaction] DISABLE

3) We conduct SSIS data flow with default fast load, which I read is equivalent to bulk insert. No transformations occur here.

4) Then reenable all indexes after import is done.

alter index all on dbo.[CustomerTransaction] REBUILD

We are hearing information that ETL Stage tables are good as heaps. However, also learning of fragmentation and performance issues with heaps. Reading all the articles below

https://sqlsunday.com/2016/09/01/compelling-case-for-heaps/

https://www.mssqltips.com/sqlservertip/4961/sql-server-insert-performance-for-clustered-indexes-vs-heap-tables/

http://kejser.org/clustered-indexes-vs-heaps/

https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-heaps-and-their-fragmentation/

We know a good reason to have identity is for row labelling, however question is mostly about internals and performance.

sql-server database-design sql-server-2016 performance-tuning etl

edited Nov 10 at 16:57

asked Nov 10 at 9:29

knightbob472

6613

edited Nov 10 at 16:57

asked Nov 10 at 9:29

knightbob472

6613

edited Nov 10 at 16:57

asked Nov 10 at 9:29

knightbob472

6613

asked Nov 10 at 9:29

knightbob472

6613

asked Nov 10 at 9:29

knightbob472

6613

add a comment |

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

We had a similar scenario and recently switched our staging tables from clustered indexes to heaps. The first big advantage for us was that we wanted to allow concurrent SSIS loads into the same staging table. You can do that with a clustered index, but you'll likely run into a lot of blocking, especially with an identity column. The second big advantage was cutting down on the overhead of loading the staging tables. We found that our loads went much faster on heaps compared to clustered indexes.

Our performance testing does not show much difference, but our data
may change later. So we need to make a good design decision.

Are you sure that this is true? In the question you say that you truncate your staging tables before the load. If some part of your load process changes, it should be very straightforward to add or remove a clustered index while the tables are empty. There's no data movement involved. It doesn't sound like you would get any benefit from a clustered index, so I would try it out as a heap and monitor performance.

answered Nov 11 at 16:51

Joe Obbish

20.1k32779

add a comment |

up vote
5
down vote

Having an identity column doesn’t force you to use it as a clustered index key.

You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.

As for fragmentation in heaps - doesn’t happen on insert-only.

Edit: Go through this article about parallel insert, and notice the comparisons between heaps and clustered indexes. https://blogs.msdn.microsoft.com/sqlcat/2016/07/21/real-world-parallel-insert-what-else-you-need-to-know/

edited Nov 10 at 21:59

answered Nov 10 at 10:33

Rob Farley

13.2k12446

@knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
– Aaron Bertrand♦
Nov 10 at 16:53

Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
– Rob Farley
Nov 10 at 21:52

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "182"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f222247%2fsql-staging-tables-primary-key-clustered-or-heap%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

Our performance testing does not show much difference, but our data
may change later. So we need to make a good design decision.

answered Nov 11 at 16:51

Joe Obbish

20.1k32779

add a comment |

up vote
2
down vote

accepted

Our performance testing does not show much difference, but our data
may change later. So we need to make a good design decision.

answered Nov 11 at 16:51

Joe Obbish

20.1k32779

add a comment |

up vote
2
down vote

accepted

Our performance testing does not show much difference, but our data
may change later. So we need to make a good design decision.

answered Nov 11 at 16:51

Joe Obbish

20.1k32779

Our performance testing does not show much difference, but our data
may change later. So we need to make a good design decision.

answered Nov 11 at 16:51

Joe Obbish

20.1k32779

answered Nov 11 at 16:51

Joe Obbish

20.1k32779

answered Nov 11 at 16:51

Joe Obbish

20.1k32779

answered Nov 11 at 16:51

Joe Obbish

20.1k32779

add a comment |

up vote
5
down vote

Having an identity column doesn’t force you to use it as a clustered index key.

You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.

As for fragmentation in heaps - doesn’t happen on insert-only.

edited Nov 10 at 21:59

answered Nov 10 at 10:33

Rob Farley

13.2k12446

@knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
– Aaron Bertrand♦
Nov 10 at 16:53

Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
– Rob Farley
Nov 10 at 21:52

add a comment |

up vote
5
down vote

Having an identity column doesn’t force you to use it as a clustered index key.

You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.

As for fragmentation in heaps - doesn’t happen on insert-only.

edited Nov 10 at 21:59

answered Nov 10 at 10:33

Rob Farley

13.2k12446

@knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
– Aaron Bertrand♦
Nov 10 at 16:53

Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
– Rob Farley
Nov 10 at 21:52

add a comment |

up vote
5
down vote

Having an identity column doesn’t force you to use it as a clustered index key.

You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.

As for fragmentation in heaps - doesn’t happen on insert-only.

edited Nov 10 at 21:59

answered Nov 10 at 10:33

Rob Farley

13.2k12446

Having an identity column doesn’t force you to use it as a clustered index key.

You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.

As for fragmentation in heaps - doesn’t happen on insert-only.

edited Nov 10 at 21:59

answered Nov 10 at 10:33

Rob Farley

13.2k12446

edited Nov 10 at 21:59

answered Nov 10 at 10:33

Rob Farley

13.2k12446

answered Nov 10 at 10:33

Rob Farley

13.2k12446

answered Nov 10 at 10:33

Rob Farley

13.2k12446

@knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
– Aaron Bertrand♦
Nov 10 at 16:53

Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
– Rob Farley
Nov 10 at 21:52

add a comment |

@knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
– Aaron Bertrand♦
Nov 10 at 16:53

Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
– Rob Farley
Nov 10 at 21:52

@knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
– Aaron Bertrand♦
Nov 10 at 16:53

Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
– Rob Farley
Nov 10 at 21:52

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj