Memory requirements for back propagation - why not use the mean activation?

I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).

For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.

My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.

Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.

(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments

asked Nov 13 '18 at 10:16

Mark.F

477113

I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?

– JanWillem Huising
Nov 13 '18 at 11:28

There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.

– Mark.F
Dec 2 '18 at 16:54

add a comment |

asked Nov 13 '18 at 10:16

Mark.F

477113

I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?

– JanWillem Huising
Nov 13 '18 at 11:28

There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.

– Mark.F
Dec 2 '18 at 16:54

add a comment |

asked Nov 13 '18 at 10:16

Mark.F

477113

tensorflow memory keras neural-network backpropagation

asked Nov 13 '18 at 10:16

Mark.F

477113

asked Nov 13 '18 at 10:16

Mark.F

477113

asked Nov 13 '18 at 10:16

Mark.F

477113

asked Nov 13 '18 at 10:16

Mark.F

477113

asked Nov 13 '18 at 10:16

Mark.F

477113

I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?

– JanWillem Huising
Nov 13 '18 at 11:28

There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.

– Mark.F
Dec 2 '18 at 16:54

add a comment |

I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?

– JanWillem Huising
Nov 13 '18 at 11:28

There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.

– Mark.F
Dec 2 '18 at 16:54

I have no answer to you're question but i am interested in what you are saying. Could you provide your test data ?

– JanWillem Huising
Nov 13 '18 at 11:28

There was nothing special in what I did. Used a slightly modified AlexNet on a CIFAR10 dataset.

– Mark.F
Dec 2 '18 at 16:54

add a comment |

1 Answer
1

active

oldest

votes

The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.

The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.

The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).

I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.

answered Dec 2 '18 at 17:23

Mark.F

477113

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278677%2fmemory-requirements-for-back-propagation-why-not-use-the-mean-activation%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Dec 2 '18 at 17:23

Mark.F

477113

add a comment |

answered Dec 2 '18 at 17:23

Mark.F

477113

add a comment |

answered Dec 2 '18 at 17:23

Mark.F

477113

answered Dec 2 '18 at 17:23

Mark.F

477113

answered Dec 2 '18 at 17:23

Mark.F

477113

answered Dec 2 '18 at 17:23

Mark.F

477113

answered Dec 2 '18 at 17:23

Mark.F

477113

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

mlFOswRZGN4,SY8soekD0 Izyu kpzZMTG O3q

搜尋此網誌

Odtnhj