Memory requirements for back propagation - why not use the mean activation?
I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).
For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.
My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.
Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.
(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments
tensorflow memory keras neural-network backpropagation
add a comment |
I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).
For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.
My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.
Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.
(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments
tensorflow memory keras neural-network backpropagation
I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?
– JanWillem Huising
Nov 13 '18 at 11:28
There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.
– Mark.F
Dec 2 '18 at 16:54
add a comment |
I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).
For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.
My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.
Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.
(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments
tensorflow memory keras neural-network backpropagation
I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).
For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.
My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.
Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.
(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments
tensorflow memory keras neural-network backpropagation
tensorflow memory keras neural-network backpropagation
asked Nov 13 '18 at 10:16
Mark.FMark.F
477113
477113
I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?
– JanWillem Huising
Nov 13 '18 at 11:28
There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.
– Mark.F
Dec 2 '18 at 16:54
add a comment |
I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?
– JanWillem Huising
Nov 13 '18 at 11:28
There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.
– Mark.F
Dec 2 '18 at 16:54
I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?
– JanWillem Huising
Nov 13 '18 at 11:28
I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?
– JanWillem Huising
Nov 13 '18 at 11:28
There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.
– Mark.F
Dec 2 '18 at 16:54
There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.
– Mark.F
Dec 2 '18 at 16:54
add a comment |
1 Answer
1
active
oldest
votes
The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.
The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.
The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).
I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278677%2fmemory-requirements-for-back-propagation-why-not-use-the-mean-activation%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.
The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.
The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).
I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.
add a comment |
The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.
The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.
The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).
I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.
add a comment |
The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.
The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.
The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).
I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.
The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.
The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.
The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).
I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.
answered Dec 2 '18 at 17:23
Mark.FMark.F
477113
477113
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278677%2fmemory-requirements-for-back-propagation-why-not-use-the-mean-activation%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?
– JanWillem Huising
Nov 13 '18 at 11:28
There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.
– Mark.F
Dec 2 '18 at 16:54