Memory requirements for back propagation - why not use the mean activation?










1















I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).



For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.



My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.



Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.



(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments










share|improve this question






















  • I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?

    – JanWillem Huising
    Nov 13 '18 at 11:28











  • There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.

    – Mark.F
    Dec 2 '18 at 16:54















1















I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).



For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.



My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.



Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.



(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments










share|improve this question






















  • I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?

    – JanWillem Huising
    Nov 13 '18 at 11:28











  • There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.

    – Mark.F
    Dec 2 '18 at 16:54













1












1








1








I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).



For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.



My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.



Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.



(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments










share|improve this question














I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).



For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.



My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.



Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.



(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments







tensorflow memory keras neural-network backpropagation






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 10:16









Mark.FMark.F

477113




477113












  • I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?

    – JanWillem Huising
    Nov 13 '18 at 11:28











  • There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.

    – Mark.F
    Dec 2 '18 at 16:54

















  • I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?

    – JanWillem Huising
    Nov 13 '18 at 11:28











  • There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.

    – Mark.F
    Dec 2 '18 at 16:54
















I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?

– JanWillem Huising
Nov 13 '18 at 11:28





I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?

– JanWillem Huising
Nov 13 '18 at 11:28













There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.

– Mark.F
Dec 2 '18 at 16:54





There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.

– Mark.F
Dec 2 '18 at 16:54












1 Answer
1






active

oldest

votes


















0














The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.



The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.



The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).



I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278677%2fmemory-requirements-for-back-propagation-why-not-use-the-mean-activation%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
    Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.



    The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.



    The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).



    I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.






    share|improve this answer



























      0














      The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
      Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.



      The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.



      The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).



      I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.






      share|improve this answer

























        0












        0








        0







        The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
        Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.



        The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.



        The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).



        I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.






        share|improve this answer













        The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
        Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.



        The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.



        The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).



        I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Dec 2 '18 at 17:23









        Mark.FMark.F

        477113




        477113



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278677%2fmemory-requirements-for-back-propagation-why-not-use-the-mean-activation%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            How to read a connectionString WITH PROVIDER in .NET Core?

            Guadeloupe

            Node.js Script on GitHub Pages or Amazon S3