tf.train.shuffle_batch returning nan's after random number of iterations

While training a pretty standard convolutional net, I discovered a weird bug. Everything starts out fine with a nice loss curve, but suddenly the loss goes to nan. I was able to trace the nans back all the way to the input pipeline.

As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.

What might be causing this? I have played around with different values of capacity, threads, etc.

The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.

I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.

def input_pipeline(self, filenames, batch_size, num_epochs=None):
 """Function that creates a highly abstracted input pipeline consisting
 of a bunch of threads and queues given a few simple parameters.

 See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
 for more information and in-depth explanations.

 Args:
 - filenames: a list of filenames of tfrecords files
 random.shuffle(filenames)

 train_filenames = filenames
 train_filename_queue = (
 tf.train.string_input_producer(train_filenames,
 num_epochs=num_epochs,
 shuffle=True,
 seed=1))

 before_image, after_image, mask_image = (
 self._read_and_decode_reach_tfrecords(train_filename_queue))

 # min_after_dequeue defines how big a buffer we will randomly sample
 # from -- bigger means better shuffling but slower start up and more
 # memory used.
 # capacity must be larger than min_after_dequeue and the amount larger
 # determines the maximum we will prefetch. Recommendation:
 # min_after_dequeue + (num_threads + a safety margin) * batch_size
 min_after_dequeue = 1000
 saftey_margin = 3
 capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
 capacity = 2000

 before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
 mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")

 before_images, after_images, mask_images = (
 tf.train.shuffle_batch(
 [before_image, after_image, mask_image], batch_size=batch_size, 
 capacity=capacity, min_after_dequeue=min_after_dequeue, 
 num_threads=5, seed=1))

 before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")

asked Nov 14 '18 at 1:57

Michael Vander Meiden

add a comment |

As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.

What might be causing this? I have played around with different values of capacity, threads, etc.

The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.

I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.

def input_pipeline(self, filenames, batch_size, num_epochs=None):
 """Function that creates a highly abstracted input pipeline consisting
 of a bunch of threads and queues given a few simple parameters.

 See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
 for more information and in-depth explanations.

 Args:
 - filenames: a list of filenames of tfrecords files
 random.shuffle(filenames)

 train_filenames = filenames
 train_filename_queue = (
 tf.train.string_input_producer(train_filenames,
 num_epochs=num_epochs,
 shuffle=True,
 seed=1))

 before_image, after_image, mask_image = (
 self._read_and_decode_reach_tfrecords(train_filename_queue))

 # min_after_dequeue defines how big a buffer we will randomly sample
 # from -- bigger means better shuffling but slower start up and more
 # memory used.
 # capacity must be larger than min_after_dequeue and the amount larger
 # determines the maximum we will prefetch. Recommendation:
 # min_after_dequeue + (num_threads + a safety margin) * batch_size
 min_after_dequeue = 1000
 saftey_margin = 3
 capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
 capacity = 2000

 before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
 mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")

 before_images, after_images, mask_images = (
 tf.train.shuffle_batch(
 [before_image, after_image, mask_image], batch_size=batch_size, 
 capacity=capacity, min_after_dequeue=min_after_dequeue, 
 num_threads=5, seed=1))

 before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")

asked Nov 14 '18 at 1:57

Michael Vander Meiden

add a comment |

As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.

What might be causing this? I have played around with different values of capacity, threads, etc.

The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.

I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.

def input_pipeline(self, filenames, batch_size, num_epochs=None):
 """Function that creates a highly abstracted input pipeline consisting
 of a bunch of threads and queues given a few simple parameters.

 See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
 for more information and in-depth explanations.

 Args:
 - filenames: a list of filenames of tfrecords files
 random.shuffle(filenames)

 train_filenames = filenames
 train_filename_queue = (
 tf.train.string_input_producer(train_filenames,
 num_epochs=num_epochs,
 shuffle=True,
 seed=1))

 before_image, after_image, mask_image = (
 self._read_and_decode_reach_tfrecords(train_filename_queue))

 # min_after_dequeue defines how big a buffer we will randomly sample
 # from -- bigger means better shuffling but slower start up and more
 # memory used.
 # capacity must be larger than min_after_dequeue and the amount larger
 # determines the maximum we will prefetch. Recommendation:
 # min_after_dequeue + (num_threads + a safety margin) * batch_size
 min_after_dequeue = 1000
 saftey_margin = 3
 capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
 capacity = 2000

 before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
 mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")

 before_images, after_images, mask_images = (
 tf.train.shuffle_batch(
 [before_image, after_image, mask_image], batch_size=batch_size, 
 capacity=capacity, min_after_dequeue=min_after_dequeue, 
 num_threads=5, seed=1))

 before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")

asked Nov 14 '18 at 1:57

Michael Vander Meiden

As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.

What might be causing this? I have played around with different values of capacity, threads, etc.

The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.

I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.

def input_pipeline(self, filenames, batch_size, num_epochs=None):
 """Function that creates a highly abstracted input pipeline consisting
 of a bunch of threads and queues given a few simple parameters.

 See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
 for more information and in-depth explanations.

 Args:
 - filenames: a list of filenames of tfrecords files
 random.shuffle(filenames)

 train_filenames = filenames
 train_filename_queue = (
 tf.train.string_input_producer(train_filenames,
 num_epochs=num_epochs,
 shuffle=True,
 seed=1))

 before_image, after_image, mask_image = (
 self._read_and_decode_reach_tfrecords(train_filename_queue))

 # min_after_dequeue defines how big a buffer we will randomly sample
 # from -- bigger means better shuffling but slower start up and more
 # memory used.
 # capacity must be larger than min_after_dequeue and the amount larger
 # determines the maximum we will prefetch. Recommendation:
 # min_after_dequeue + (num_threads + a safety margin) * batch_size
 min_after_dequeue = 1000
 saftey_margin = 3
 capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
 capacity = 2000

 before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
 mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")

 before_images, after_images, mask_images = (
 tf.train.shuffle_batch(
 [before_image, after_image, mask_image], batch_size=batch_size, 
 capacity=capacity, min_after_dequeue=min_after_dequeue, 
 num_threads=5, seed=1))

 before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")

python tensorflow deep-learning

asked Nov 14 '18 at 1:57

Michael Vander Meiden

asked Nov 14 '18 at 1:57

Michael Vander Meiden

asked Nov 14 '18 at 1:57

Michael Vander Meiden

asked Nov 14 '18 at 1:57

Michael Vander Meiden

asked Nov 14 '18 at 1:57

Michael Vander Meiden

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53292105%2ftf-train-shuffle-batch-returning-nans-after-random-number-of-iterations%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj