How to model “dimension” tables in TiDB?










0















I would like to designate certain tables as replicated to all TiKV stores such that they are always available to join with locally (thereby reducing expensive distributed joins at the TiDB level). This would allow the TiKV coprocessor to join locally to this table because it's always available
(ie: replicated to every TiKV). In the OLAP terminology of "dimensions" and "facts", this is a dimension table. In this scenario, I'd like to shard facts and replicate dimensions. It appears that TiDB treats everything as a sharded fact. Can this be done? If not, can it be approximated with some other technique? How amenable is the code base to allowing this type of feature?










share|improve this question


























    0















    I would like to designate certain tables as replicated to all TiKV stores such that they are always available to join with locally (thereby reducing expensive distributed joins at the TiDB level). This would allow the TiKV coprocessor to join locally to this table because it's always available
    (ie: replicated to every TiKV). In the OLAP terminology of "dimensions" and "facts", this is a dimension table. In this scenario, I'd like to shard facts and replicate dimensions. It appears that TiDB treats everything as a sharded fact. Can this be done? If not, can it be approximated with some other technique? How amenable is the code base to allowing this type of feature?










    share|improve this question
























      0












      0








      0








      I would like to designate certain tables as replicated to all TiKV stores such that they are always available to join with locally (thereby reducing expensive distributed joins at the TiDB level). This would allow the TiKV coprocessor to join locally to this table because it's always available
      (ie: replicated to every TiKV). In the OLAP terminology of "dimensions" and "facts", this is a dimension table. In this scenario, I'd like to shard facts and replicate dimensions. It appears that TiDB treats everything as a sharded fact. Can this be done? If not, can it be approximated with some other technique? How amenable is the code base to allowing this type of feature?










      share|improve this question














      I would like to designate certain tables as replicated to all TiKV stores such that they are always available to join with locally (thereby reducing expensive distributed joins at the TiDB level). This would allow the TiKV coprocessor to join locally to this table because it's always available
      (ie: replicated to every TiKV). In the OLAP terminology of "dimensions" and "facts", this is a dimension table. In this scenario, I'd like to shard facts and replicate dimensions. It appears that TiDB treats everything as a sharded fact. Can this be done? If not, can it be approximated with some other technique? How amenable is the code base to allowing this type of feature?







      distributed-database tidb






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 15 '18 at 7:00









      Jeremy NorrisJeremy Norris

      11




      11






















          1 Answer
          1






          active

          oldest

          votes


















          0














          At present, TiDB splits each table into regions, and do the replication in the region level. It's hard to replicate a table into each TiKV server, even if it only contains one region. For example there are 100 nodes in the TiKV cluster but the configured number of region replica is 5.



          We don't need to do the join operation in the TiKV coprocessor. We can read each dimension table from TiKV to multiply TiDB nodes and associate each involved TiDB node a portion of the fact table according to the data distribution of the fact table. Thus the join operation is done in the TiDB layer.



          The technique described in the above is not implemented yet. But it's already on our roadmap.






          share|improve this answer























          • Thanks for the response. Isn't what you describe how the join is done in TiDB today (2.1.0)? If not, how is this future technique different? This approach seems a lot slower than if this join could be pushed down to TiKV, otherwise we end up with a large distributed join at the TiDB layer (especially as the dimension table becomes large).

            – Jeremy Norris
            Nov 16 '18 at 5:41











          • @JeremyNorris, No, the join operation I described above has not been implemented in the current TiDB today(v2.1.0 or the next release).

            – Jian Zhang
            Nov 17 '18 at 6:13











          • TiKV itself use a lot of memory to buffer the table or index data to avoid scanning disk for some requests. I think this is another reason that the join is better not to be pushed to the TiKV layer, because the join operation also consumes a huge amount of memory. If the join is pushed to the TiKV layer, the risk of OOM on the TiKV server is increased.

            – Jian Zhang
            Nov 17 '18 at 6:24











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314007%2fhow-to-model-dimension-tables-in-tidb%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          At present, TiDB splits each table into regions, and do the replication in the region level. It's hard to replicate a table into each TiKV server, even if it only contains one region. For example there are 100 nodes in the TiKV cluster but the configured number of region replica is 5.



          We don't need to do the join operation in the TiKV coprocessor. We can read each dimension table from TiKV to multiply TiDB nodes and associate each involved TiDB node a portion of the fact table according to the data distribution of the fact table. Thus the join operation is done in the TiDB layer.



          The technique described in the above is not implemented yet. But it's already on our roadmap.






          share|improve this answer























          • Thanks for the response. Isn't what you describe how the join is done in TiDB today (2.1.0)? If not, how is this future technique different? This approach seems a lot slower than if this join could be pushed down to TiKV, otherwise we end up with a large distributed join at the TiDB layer (especially as the dimension table becomes large).

            – Jeremy Norris
            Nov 16 '18 at 5:41











          • @JeremyNorris, No, the join operation I described above has not been implemented in the current TiDB today(v2.1.0 or the next release).

            – Jian Zhang
            Nov 17 '18 at 6:13











          • TiKV itself use a lot of memory to buffer the table or index data to avoid scanning disk for some requests. I think this is another reason that the join is better not to be pushed to the TiKV layer, because the join operation also consumes a huge amount of memory. If the join is pushed to the TiKV layer, the risk of OOM on the TiKV server is increased.

            – Jian Zhang
            Nov 17 '18 at 6:24
















          0














          At present, TiDB splits each table into regions, and do the replication in the region level. It's hard to replicate a table into each TiKV server, even if it only contains one region. For example there are 100 nodes in the TiKV cluster but the configured number of region replica is 5.



          We don't need to do the join operation in the TiKV coprocessor. We can read each dimension table from TiKV to multiply TiDB nodes and associate each involved TiDB node a portion of the fact table according to the data distribution of the fact table. Thus the join operation is done in the TiDB layer.



          The technique described in the above is not implemented yet. But it's already on our roadmap.






          share|improve this answer























          • Thanks for the response. Isn't what you describe how the join is done in TiDB today (2.1.0)? If not, how is this future technique different? This approach seems a lot slower than if this join could be pushed down to TiKV, otherwise we end up with a large distributed join at the TiDB layer (especially as the dimension table becomes large).

            – Jeremy Norris
            Nov 16 '18 at 5:41











          • @JeremyNorris, No, the join operation I described above has not been implemented in the current TiDB today(v2.1.0 or the next release).

            – Jian Zhang
            Nov 17 '18 at 6:13











          • TiKV itself use a lot of memory to buffer the table or index data to avoid scanning disk for some requests. I think this is another reason that the join is better not to be pushed to the TiKV layer, because the join operation also consumes a huge amount of memory. If the join is pushed to the TiKV layer, the risk of OOM on the TiKV server is increased.

            – Jian Zhang
            Nov 17 '18 at 6:24














          0












          0








          0







          At present, TiDB splits each table into regions, and do the replication in the region level. It's hard to replicate a table into each TiKV server, even if it only contains one region. For example there are 100 nodes in the TiKV cluster but the configured number of region replica is 5.



          We don't need to do the join operation in the TiKV coprocessor. We can read each dimension table from TiKV to multiply TiDB nodes and associate each involved TiDB node a portion of the fact table according to the data distribution of the fact table. Thus the join operation is done in the TiDB layer.



          The technique described in the above is not implemented yet. But it's already on our roadmap.






          share|improve this answer













          At present, TiDB splits each table into regions, and do the replication in the region level. It's hard to replicate a table into each TiKV server, even if it only contains one region. For example there are 100 nodes in the TiKV cluster but the configured number of region replica is 5.



          We don't need to do the join operation in the TiKV coprocessor. We can read each dimension table from TiKV to multiply TiDB nodes and associate each involved TiDB node a portion of the fact table according to the data distribution of the fact table. Thus the join operation is done in the TiDB layer.



          The technique described in the above is not implemented yet. But it's already on our roadmap.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 16 '18 at 2:30









          Jian ZhangJian Zhang

          11




          11












          • Thanks for the response. Isn't what you describe how the join is done in TiDB today (2.1.0)? If not, how is this future technique different? This approach seems a lot slower than if this join could be pushed down to TiKV, otherwise we end up with a large distributed join at the TiDB layer (especially as the dimension table becomes large).

            – Jeremy Norris
            Nov 16 '18 at 5:41











          • @JeremyNorris, No, the join operation I described above has not been implemented in the current TiDB today(v2.1.0 or the next release).

            – Jian Zhang
            Nov 17 '18 at 6:13











          • TiKV itself use a lot of memory to buffer the table or index data to avoid scanning disk for some requests. I think this is another reason that the join is better not to be pushed to the TiKV layer, because the join operation also consumes a huge amount of memory. If the join is pushed to the TiKV layer, the risk of OOM on the TiKV server is increased.

            – Jian Zhang
            Nov 17 '18 at 6:24


















          • Thanks for the response. Isn't what you describe how the join is done in TiDB today (2.1.0)? If not, how is this future technique different? This approach seems a lot slower than if this join could be pushed down to TiKV, otherwise we end up with a large distributed join at the TiDB layer (especially as the dimension table becomes large).

            – Jeremy Norris
            Nov 16 '18 at 5:41











          • @JeremyNorris, No, the join operation I described above has not been implemented in the current TiDB today(v2.1.0 or the next release).

            – Jian Zhang
            Nov 17 '18 at 6:13











          • TiKV itself use a lot of memory to buffer the table or index data to avoid scanning disk for some requests. I think this is another reason that the join is better not to be pushed to the TiKV layer, because the join operation also consumes a huge amount of memory. If the join is pushed to the TiKV layer, the risk of OOM on the TiKV server is increased.

            – Jian Zhang
            Nov 17 '18 at 6:24

















          Thanks for the response. Isn't what you describe how the join is done in TiDB today (2.1.0)? If not, how is this future technique different? This approach seems a lot slower than if this join could be pushed down to TiKV, otherwise we end up with a large distributed join at the TiDB layer (especially as the dimension table becomes large).

          – Jeremy Norris
          Nov 16 '18 at 5:41





          Thanks for the response. Isn't what you describe how the join is done in TiDB today (2.1.0)? If not, how is this future technique different? This approach seems a lot slower than if this join could be pushed down to TiKV, otherwise we end up with a large distributed join at the TiDB layer (especially as the dimension table becomes large).

          – Jeremy Norris
          Nov 16 '18 at 5:41













          @JeremyNorris, No, the join operation I described above has not been implemented in the current TiDB today(v2.1.0 or the next release).

          – Jian Zhang
          Nov 17 '18 at 6:13





          @JeremyNorris, No, the join operation I described above has not been implemented in the current TiDB today(v2.1.0 or the next release).

          – Jian Zhang
          Nov 17 '18 at 6:13













          TiKV itself use a lot of memory to buffer the table or index data to avoid scanning disk for some requests. I think this is another reason that the join is better not to be pushed to the TiKV layer, because the join operation also consumes a huge amount of memory. If the join is pushed to the TiKV layer, the risk of OOM on the TiKV server is increased.

          – Jian Zhang
          Nov 17 '18 at 6:24






          TiKV itself use a lot of memory to buffer the table or index data to avoid scanning disk for some requests. I think this is another reason that the join is better not to be pushed to the TiKV layer, because the join operation also consumes a huge amount of memory. If the join is pushed to the TiKV layer, the risk of OOM on the TiKV server is increased.

          – Jian Zhang
          Nov 17 '18 at 6:24




















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314007%2fhow-to-model-dimension-tables-in-tidb%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          How to read a connectionString WITH PROVIDER in .NET Core?

          In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

          Museum of Modern and Contemporary Art of Trento and Rovereto