SQL Staging Tables: Primary Key Clustered or Heap



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
5
down vote

favorite
1












We are taking legacy flat txt files and inserting them into stage tables with SSIS.The question arose whether table should have primary clustered key index or not. This is direct flat file import with no transformation.



create table dbo.CustomerTransaction
(
CustomerName varchar(255),
PurchaseLocation varchar(255),
Productid int,
AmountSold float,
CustomerAddress varchar(50)
)

create table dbo.CustomerTransaction
(
-- discussion for adding this column
CustomerTransactionId int primary key clustered identity(1,1)

CustomerName varchar(255),
PurchaseLocation varchar(255),
Productid int,
AmountSold float,
CustomerAddress varchar(50)
)

-- both tables have nonclustered indexes
create nonclustered index idx_ProductId on dbo.CustomerTransaction(ProductId)
create nonclustered index idx_CustomerAddress on dbo.CustomerTransaction(CustomerAddress)

-- Actually have more indexes, tables above are just for sample


1) Before ETL, the staging tables are truncated. There are No Deletes and No Updates. Only Inserts.



truncate table dbo.[CustomerTransaction]


2) Then disable all indexes before ETL.



alter index all on dbo.[CustomerTransaction] DISABLE


3) We conduct SSIS data flow with default fast load, which I read is equivalent to bulk insert. No transformations occur here.



4) Then reenable all indexes after import is done.



alter index all on dbo.[CustomerTransaction] REBUILD


5) The staging tables are then selected on join and where clauses, and placed into datawarehouse. This we why we have nonclustered indexes. After data warehouse is loaded, we truncate the staging tables.



We are hearing information that ETL Stage tables are good as heaps. However, also learning of fragmentation and performance issues with heaps. Reading all the articles below



I am reading conflicting opinions. One says Binary tree clustered are maintenance headaches for import ETL. Other says Heaps have performance issues with fragmentation. Our performance testing does not show much difference, but our data may change later. So we need to make a good design decision.



https://sqlsunday.com/2016/09/01/compelling-case-for-heaps/



https://www.mssqltips.com/sqlservertip/4961/sql-server-insert-performance-for-clustered-indexes-vs-heap-tables/



http://kejser.org/clustered-indexes-vs-heaps/



https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-heaps-and-their-fragmentation/



We know a good reason to have identity is for row labelling, however question is mostly about internals and performance.










share|improve this question





























    up vote
    5
    down vote

    favorite
    1












    We are taking legacy flat txt files and inserting them into stage tables with SSIS.The question arose whether table should have primary clustered key index or not. This is direct flat file import with no transformation.



    create table dbo.CustomerTransaction
    (
    CustomerName varchar(255),
    PurchaseLocation varchar(255),
    Productid int,
    AmountSold float,
    CustomerAddress varchar(50)
    )

    create table dbo.CustomerTransaction
    (
    -- discussion for adding this column
    CustomerTransactionId int primary key clustered identity(1,1)

    CustomerName varchar(255),
    PurchaseLocation varchar(255),
    Productid int,
    AmountSold float,
    CustomerAddress varchar(50)
    )

    -- both tables have nonclustered indexes
    create nonclustered index idx_ProductId on dbo.CustomerTransaction(ProductId)
    create nonclustered index idx_CustomerAddress on dbo.CustomerTransaction(CustomerAddress)

    -- Actually have more indexes, tables above are just for sample


    1) Before ETL, the staging tables are truncated. There are No Deletes and No Updates. Only Inserts.



    truncate table dbo.[CustomerTransaction]


    2) Then disable all indexes before ETL.



    alter index all on dbo.[CustomerTransaction] DISABLE


    3) We conduct SSIS data flow with default fast load, which I read is equivalent to bulk insert. No transformations occur here.



    4) Then reenable all indexes after import is done.



    alter index all on dbo.[CustomerTransaction] REBUILD


    5) The staging tables are then selected on join and where clauses, and placed into datawarehouse. This we why we have nonclustered indexes. After data warehouse is loaded, we truncate the staging tables.



    We are hearing information that ETL Stage tables are good as heaps. However, also learning of fragmentation and performance issues with heaps. Reading all the articles below



    I am reading conflicting opinions. One says Binary tree clustered are maintenance headaches for import ETL. Other says Heaps have performance issues with fragmentation. Our performance testing does not show much difference, but our data may change later. So we need to make a good design decision.



    https://sqlsunday.com/2016/09/01/compelling-case-for-heaps/



    https://www.mssqltips.com/sqlservertip/4961/sql-server-insert-performance-for-clustered-indexes-vs-heap-tables/



    http://kejser.org/clustered-indexes-vs-heaps/



    https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-heaps-and-their-fragmentation/



    We know a good reason to have identity is for row labelling, however question is mostly about internals and performance.










    share|improve this question

























      up vote
      5
      down vote

      favorite
      1









      up vote
      5
      down vote

      favorite
      1






      1





      We are taking legacy flat txt files and inserting them into stage tables with SSIS.The question arose whether table should have primary clustered key index or not. This is direct flat file import with no transformation.



      create table dbo.CustomerTransaction
      (
      CustomerName varchar(255),
      PurchaseLocation varchar(255),
      Productid int,
      AmountSold float,
      CustomerAddress varchar(50)
      )

      create table dbo.CustomerTransaction
      (
      -- discussion for adding this column
      CustomerTransactionId int primary key clustered identity(1,1)

      CustomerName varchar(255),
      PurchaseLocation varchar(255),
      Productid int,
      AmountSold float,
      CustomerAddress varchar(50)
      )

      -- both tables have nonclustered indexes
      create nonclustered index idx_ProductId on dbo.CustomerTransaction(ProductId)
      create nonclustered index idx_CustomerAddress on dbo.CustomerTransaction(CustomerAddress)

      -- Actually have more indexes, tables above are just for sample


      1) Before ETL, the staging tables are truncated. There are No Deletes and No Updates. Only Inserts.



      truncate table dbo.[CustomerTransaction]


      2) Then disable all indexes before ETL.



      alter index all on dbo.[CustomerTransaction] DISABLE


      3) We conduct SSIS data flow with default fast load, which I read is equivalent to bulk insert. No transformations occur here.



      4) Then reenable all indexes after import is done.



      alter index all on dbo.[CustomerTransaction] REBUILD


      5) The staging tables are then selected on join and where clauses, and placed into datawarehouse. This we why we have nonclustered indexes. After data warehouse is loaded, we truncate the staging tables.



      We are hearing information that ETL Stage tables are good as heaps. However, also learning of fragmentation and performance issues with heaps. Reading all the articles below



      I am reading conflicting opinions. One says Binary tree clustered are maintenance headaches for import ETL. Other says Heaps have performance issues with fragmentation. Our performance testing does not show much difference, but our data may change later. So we need to make a good design decision.



      https://sqlsunday.com/2016/09/01/compelling-case-for-heaps/



      https://www.mssqltips.com/sqlservertip/4961/sql-server-insert-performance-for-clustered-indexes-vs-heap-tables/



      http://kejser.org/clustered-indexes-vs-heaps/



      https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-heaps-and-their-fragmentation/



      We know a good reason to have identity is for row labelling, however question is mostly about internals and performance.










      share|improve this question















      We are taking legacy flat txt files and inserting them into stage tables with SSIS.The question arose whether table should have primary clustered key index or not. This is direct flat file import with no transformation.



      create table dbo.CustomerTransaction
      (
      CustomerName varchar(255),
      PurchaseLocation varchar(255),
      Productid int,
      AmountSold float,
      CustomerAddress varchar(50)
      )

      create table dbo.CustomerTransaction
      (
      -- discussion for adding this column
      CustomerTransactionId int primary key clustered identity(1,1)

      CustomerName varchar(255),
      PurchaseLocation varchar(255),
      Productid int,
      AmountSold float,
      CustomerAddress varchar(50)
      )

      -- both tables have nonclustered indexes
      create nonclustered index idx_ProductId on dbo.CustomerTransaction(ProductId)
      create nonclustered index idx_CustomerAddress on dbo.CustomerTransaction(CustomerAddress)

      -- Actually have more indexes, tables above are just for sample


      1) Before ETL, the staging tables are truncated. There are No Deletes and No Updates. Only Inserts.



      truncate table dbo.[CustomerTransaction]


      2) Then disable all indexes before ETL.



      alter index all on dbo.[CustomerTransaction] DISABLE


      3) We conduct SSIS data flow with default fast load, which I read is equivalent to bulk insert. No transformations occur here.



      4) Then reenable all indexes after import is done.



      alter index all on dbo.[CustomerTransaction] REBUILD


      5) The staging tables are then selected on join and where clauses, and placed into datawarehouse. This we why we have nonclustered indexes. After data warehouse is loaded, we truncate the staging tables.



      We are hearing information that ETL Stage tables are good as heaps. However, also learning of fragmentation and performance issues with heaps. Reading all the articles below



      I am reading conflicting opinions. One says Binary tree clustered are maintenance headaches for import ETL. Other says Heaps have performance issues with fragmentation. Our performance testing does not show much difference, but our data may change later. So we need to make a good design decision.



      https://sqlsunday.com/2016/09/01/compelling-case-for-heaps/



      https://www.mssqltips.com/sqlservertip/4961/sql-server-insert-performance-for-clustered-indexes-vs-heap-tables/



      http://kejser.org/clustered-indexes-vs-heaps/



      https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-heaps-and-their-fragmentation/



      We know a good reason to have identity is for row labelling, however question is mostly about internals and performance.







      sql-server database-design sql-server-2016 performance-tuning etl






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 10 at 16:57

























      asked Nov 10 at 9:29









      knightbob472

      6613




      6613




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          We had a similar scenario and recently switched our staging tables from clustered indexes to heaps. The first big advantage for us was that we wanted to allow concurrent SSIS loads into the same staging table. You can do that with a clustered index, but you'll likely run into a lot of blocking, especially with an identity column. The second big advantage was cutting down on the overhead of loading the staging tables. We found that our loads went much faster on heaps compared to clustered indexes.




          Our performance testing does not show much difference, but our data
          may change later. So we need to make a good design decision.




          Are you sure that this is true? In the question you say that you truncate your staging tables before the load. If some part of your load process changes, it should be very straightforward to add or remove a clustered index while the tables are empty. There's no data movement involved. It doesn't sound like you would get any benefit from a clustered index, so I would try it out as a heap and monitor performance.






          share|improve this answer



























            up vote
            5
            down vote













            Having an identity column doesn’t force you to use it as a clustered index key.



            You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.



            As for fragmentation in heaps - doesn’t happen on insert-only.



            Edit: Go through this article about parallel insert, and notice the comparisons between heaps and clustered indexes. https://blogs.msdn.microsoft.com/sqlcat/2016/07/21/real-world-parallel-insert-what-else-you-need-to-know/






            share|improve this answer






















            • @knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
              – Aaron Bertrand
              Nov 10 at 16:53











            • Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
              – Rob Farley
              Nov 10 at 21:52










            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "182"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f222247%2fsql-staging-tables-primary-key-clustered-or-heap%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            2
            down vote



            accepted










            We had a similar scenario and recently switched our staging tables from clustered indexes to heaps. The first big advantage for us was that we wanted to allow concurrent SSIS loads into the same staging table. You can do that with a clustered index, but you'll likely run into a lot of blocking, especially with an identity column. The second big advantage was cutting down on the overhead of loading the staging tables. We found that our loads went much faster on heaps compared to clustered indexes.




            Our performance testing does not show much difference, but our data
            may change later. So we need to make a good design decision.




            Are you sure that this is true? In the question you say that you truncate your staging tables before the load. If some part of your load process changes, it should be very straightforward to add or remove a clustered index while the tables are empty. There's no data movement involved. It doesn't sound like you would get any benefit from a clustered index, so I would try it out as a heap and monitor performance.






            share|improve this answer
























              up vote
              2
              down vote



              accepted










              We had a similar scenario and recently switched our staging tables from clustered indexes to heaps. The first big advantage for us was that we wanted to allow concurrent SSIS loads into the same staging table. You can do that with a clustered index, but you'll likely run into a lot of blocking, especially with an identity column. The second big advantage was cutting down on the overhead of loading the staging tables. We found that our loads went much faster on heaps compared to clustered indexes.




              Our performance testing does not show much difference, but our data
              may change later. So we need to make a good design decision.




              Are you sure that this is true? In the question you say that you truncate your staging tables before the load. If some part of your load process changes, it should be very straightforward to add or remove a clustered index while the tables are empty. There's no data movement involved. It doesn't sound like you would get any benefit from a clustered index, so I would try it out as a heap and monitor performance.






              share|improve this answer






















                up vote
                2
                down vote



                accepted







                up vote
                2
                down vote



                accepted






                We had a similar scenario and recently switched our staging tables from clustered indexes to heaps. The first big advantage for us was that we wanted to allow concurrent SSIS loads into the same staging table. You can do that with a clustered index, but you'll likely run into a lot of blocking, especially with an identity column. The second big advantage was cutting down on the overhead of loading the staging tables. We found that our loads went much faster on heaps compared to clustered indexes.




                Our performance testing does not show much difference, but our data
                may change later. So we need to make a good design decision.




                Are you sure that this is true? In the question you say that you truncate your staging tables before the load. If some part of your load process changes, it should be very straightforward to add or remove a clustered index while the tables are empty. There's no data movement involved. It doesn't sound like you would get any benefit from a clustered index, so I would try it out as a heap and monitor performance.






                share|improve this answer












                We had a similar scenario and recently switched our staging tables from clustered indexes to heaps. The first big advantage for us was that we wanted to allow concurrent SSIS loads into the same staging table. You can do that with a clustered index, but you'll likely run into a lot of blocking, especially with an identity column. The second big advantage was cutting down on the overhead of loading the staging tables. We found that our loads went much faster on heaps compared to clustered indexes.




                Our performance testing does not show much difference, but our data
                may change later. So we need to make a good design decision.




                Are you sure that this is true? In the question you say that you truncate your staging tables before the load. If some part of your load process changes, it should be very straightforward to add or remove a clustered index while the tables are empty. There's no data movement involved. It doesn't sound like you would get any benefit from a clustered index, so I would try it out as a heap and monitor performance.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 11 at 16:51









                Joe Obbish

                20.1k32779




                20.1k32779






















                    up vote
                    5
                    down vote













                    Having an identity column doesn’t force you to use it as a clustered index key.



                    You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.



                    As for fragmentation in heaps - doesn’t happen on insert-only.



                    Edit: Go through this article about parallel insert, and notice the comparisons between heaps and clustered indexes. https://blogs.msdn.microsoft.com/sqlcat/2016/07/21/real-world-parallel-insert-what-else-you-need-to-know/






                    share|improve this answer






















                    • @knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
                      – Aaron Bertrand
                      Nov 10 at 16:53











                    • Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
                      – Rob Farley
                      Nov 10 at 21:52














                    up vote
                    5
                    down vote













                    Having an identity column doesn’t force you to use it as a clustered index key.



                    You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.



                    As for fragmentation in heaps - doesn’t happen on insert-only.



                    Edit: Go through this article about parallel insert, and notice the comparisons between heaps and clustered indexes. https://blogs.msdn.microsoft.com/sqlcat/2016/07/21/real-world-parallel-insert-what-else-you-need-to-know/






                    share|improve this answer






















                    • @knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
                      – Aaron Bertrand
                      Nov 10 at 16:53











                    • Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
                      – Rob Farley
                      Nov 10 at 21:52












                    up vote
                    5
                    down vote










                    up vote
                    5
                    down vote









                    Having an identity column doesn’t force you to use it as a clustered index key.



                    You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.



                    As for fragmentation in heaps - doesn’t happen on insert-only.



                    Edit: Go through this article about parallel insert, and notice the comparisons between heaps and clustered indexes. https://blogs.msdn.microsoft.com/sqlcat/2016/07/21/real-world-parallel-insert-what-else-you-need-to-know/






                    share|improve this answer














                    Having an identity column doesn’t force you to use it as a clustered index key.



                    You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.



                    As for fragmentation in heaps - doesn’t happen on insert-only.



                    Edit: Go through this article about parallel insert, and notice the comparisons between heaps and clustered indexes. https://blogs.msdn.microsoft.com/sqlcat/2016/07/21/real-world-parallel-insert-what-else-you-need-to-know/







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Nov 10 at 21:59

























                    answered Nov 10 at 10:33









                    Rob Farley

                    13.2k12446




                    13.2k12446











                    • @knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
                      – Aaron Bertrand
                      Nov 10 at 16:53











                    • Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
                      – Rob Farley
                      Nov 10 at 21:52
















                    • @knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
                      – Aaron Bertrand
                      Nov 10 at 16:53











                    • Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
                      – Rob Farley
                      Nov 10 at 21:52















                    @knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
                    – Aaron Bertrand
                    Nov 10 at 16:53





                    @knightbob472 Are you doing 30 of these flat file imports simultaneously, or just one? A guid won't really help distribute I/O for a straight insert unless (a) you are inserting enough rows to parallelize from the source and (b) you're actually partitioned in a way that the GUIDs will all land on separate I/O devices. Generally Kejser's advice is quite good, but some of it only applies to the very, very high-end.
                    – Aaron Bertrand
                    Nov 10 at 16:53













                    Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
                    – Rob Farley
                    Nov 10 at 21:52




                    Thomas’s stuff includes hotspot avoidance caused by having a clustered index on an identity column. If you’re pushing data into a table across multiple threads, you can see the same problem, even if you’re not doing it at the extreme.
                    – Rob Farley
                    Nov 10 at 21:52

















                     

                    draft saved


                    draft discarded















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f222247%2fsql-staging-tables-primary-key-clustered-or-heap%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    What does pagestruct do in Eviews?

                    Dutch intervention in Lombok and Karangasem

                    Channel Islands