tf.train.shuffle_batch returning nan's after random number of iterations
While training a pretty standard convolutional net, I discovered a weird bug. Everything starts out fine with a nice loss curve, but suddenly the loss goes to nan. I was able to trace the nans back all the way to the input pipeline.
As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.
What might be causing this? I have played around with different values of capacity, threads, etc.
The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.
I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.
def input_pipeline(self, filenames, batch_size, num_epochs=None):
"""Function that creates a highly abstracted input pipeline consisting
of a bunch of threads and queues given a few simple parameters.
See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
for more information and in-depth explanations.
Args:
- filenames: a list of filenames of tfrecords files
random.shuffle(filenames)
train_filenames = filenames
train_filename_queue = (
tf.train.string_input_producer(train_filenames,
num_epochs=num_epochs,
shuffle=True,
seed=1))
before_image, after_image, mask_image = (
self._read_and_decode_reach_tfrecords(train_filename_queue))
# min_after_dequeue defines how big a buffer we will randomly sample
# from -- bigger means better shuffling but slower start up and more
# memory used.
# capacity must be larger than min_after_dequeue and the amount larger
# determines the maximum we will prefetch. Recommendation:
# min_after_dequeue + (num_threads + a safety margin) * batch_size
min_after_dequeue = 1000
saftey_margin = 3
capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
capacity = 2000
before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")
before_images, after_images, mask_images = (
tf.train.shuffle_batch(
[before_image, after_image, mask_image], batch_size=batch_size,
capacity=capacity, min_after_dequeue=min_after_dequeue,
num_threads=5, seed=1))
before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")
python tensorflow deep-learning
add a comment |
While training a pretty standard convolutional net, I discovered a weird bug. Everything starts out fine with a nice loss curve, but suddenly the loss goes to nan. I was able to trace the nans back all the way to the input pipeline.
As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.
What might be causing this? I have played around with different values of capacity, threads, etc.
The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.
I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.
def input_pipeline(self, filenames, batch_size, num_epochs=None):
"""Function that creates a highly abstracted input pipeline consisting
of a bunch of threads and queues given a few simple parameters.
See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
for more information and in-depth explanations.
Args:
- filenames: a list of filenames of tfrecords files
random.shuffle(filenames)
train_filenames = filenames
train_filename_queue = (
tf.train.string_input_producer(train_filenames,
num_epochs=num_epochs,
shuffle=True,
seed=1))
before_image, after_image, mask_image = (
self._read_and_decode_reach_tfrecords(train_filename_queue))
# min_after_dequeue defines how big a buffer we will randomly sample
# from -- bigger means better shuffling but slower start up and more
# memory used.
# capacity must be larger than min_after_dequeue and the amount larger
# determines the maximum we will prefetch. Recommendation:
# min_after_dequeue + (num_threads + a safety margin) * batch_size
min_after_dequeue = 1000
saftey_margin = 3
capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
capacity = 2000
before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")
before_images, after_images, mask_images = (
tf.train.shuffle_batch(
[before_image, after_image, mask_image], batch_size=batch_size,
capacity=capacity, min_after_dequeue=min_after_dequeue,
num_threads=5, seed=1))
before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")
python tensorflow deep-learning
add a comment |
While training a pretty standard convolutional net, I discovered a weird bug. Everything starts out fine with a nice loss curve, but suddenly the loss goes to nan. I was able to trace the nans back all the way to the input pipeline.
As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.
What might be causing this? I have played around with different values of capacity, threads, etc.
The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.
I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.
def input_pipeline(self, filenames, batch_size, num_epochs=None):
"""Function that creates a highly abstracted input pipeline consisting
of a bunch of threads and queues given a few simple parameters.
See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
for more information and in-depth explanations.
Args:
- filenames: a list of filenames of tfrecords files
random.shuffle(filenames)
train_filenames = filenames
train_filename_queue = (
tf.train.string_input_producer(train_filenames,
num_epochs=num_epochs,
shuffle=True,
seed=1))
before_image, after_image, mask_image = (
self._read_and_decode_reach_tfrecords(train_filename_queue))
# min_after_dequeue defines how big a buffer we will randomly sample
# from -- bigger means better shuffling but slower start up and more
# memory used.
# capacity must be larger than min_after_dequeue and the amount larger
# determines the maximum we will prefetch. Recommendation:
# min_after_dequeue + (num_threads + a safety margin) * batch_size
min_after_dequeue = 1000
saftey_margin = 3
capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
capacity = 2000
before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")
before_images, after_images, mask_images = (
tf.train.shuffle_batch(
[before_image, after_image, mask_image], batch_size=batch_size,
capacity=capacity, min_after_dequeue=min_after_dequeue,
num_threads=5, seed=1))
before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")
python tensorflow deep-learning
While training a pretty standard convolutional net, I discovered a weird bug. Everything starts out fine with a nice loss curve, but suddenly the loss goes to nan. I was able to trace the nans back all the way to the input pipeline.
As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.
What might be causing this? I have played around with different values of capacity, threads, etc.
The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.
I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.
def input_pipeline(self, filenames, batch_size, num_epochs=None):
"""Function that creates a highly abstracted input pipeline consisting
of a bunch of threads and queues given a few simple parameters.
See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
for more information and in-depth explanations.
Args:
- filenames: a list of filenames of tfrecords files
random.shuffle(filenames)
train_filenames = filenames
train_filename_queue = (
tf.train.string_input_producer(train_filenames,
num_epochs=num_epochs,
shuffle=True,
seed=1))
before_image, after_image, mask_image = (
self._read_and_decode_reach_tfrecords(train_filename_queue))
# min_after_dequeue defines how big a buffer we will randomly sample
# from -- bigger means better shuffling but slower start up and more
# memory used.
# capacity must be larger than min_after_dequeue and the amount larger
# determines the maximum we will prefetch. Recommendation:
# min_after_dequeue + (num_threads + a safety margin) * batch_size
min_after_dequeue = 1000
saftey_margin = 3
capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
capacity = 2000
before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")
before_images, after_images, mask_images = (
tf.train.shuffle_batch(
[before_image, after_image, mask_image], batch_size=batch_size,
capacity=capacity, min_after_dequeue=min_after_dequeue,
num_threads=5, seed=1))
before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")
python tensorflow deep-learning
python tensorflow deep-learning
asked Nov 14 '18 at 1:57
Michael Vander MeidenMichael Vander Meiden
11
11
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53292105%2ftf-train-shuffle-batch-returning-nans-after-random-number-of-iterations%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53292105%2ftf-train-shuffle-batch-returning-nans-after-random-number-of-iterations%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown