tf.train.shuffle_batch returning nan's after random number of iterations










0















While training a pretty standard convolutional net, I discovered a weird bug. Everything starts out fine with a nice loss curve, but suddenly the loss goes to nan. I was able to trace the nans back all the way to the input pipeline.



As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.



What might be causing this? I have played around with different values of capacity, threads, etc.



The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.



I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.



def input_pipeline(self, filenames, batch_size, num_epochs=None):
"""Function that creates a highly abstracted input pipeline consisting
of a bunch of threads and queues given a few simple parameters.

See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
for more information and in-depth explanations.

Args:
- filenames: a list of filenames of tfrecords files
random.shuffle(filenames)

train_filenames = filenames
train_filename_queue = (
tf.train.string_input_producer(train_filenames,
num_epochs=num_epochs,
shuffle=True,
seed=1))

before_image, after_image, mask_image = (
self._read_and_decode_reach_tfrecords(train_filename_queue))

# min_after_dequeue defines how big a buffer we will randomly sample
# from -- bigger means better shuffling but slower start up and more
# memory used.
# capacity must be larger than min_after_dequeue and the amount larger
# determines the maximum we will prefetch. Recommendation:
# min_after_dequeue + (num_threads + a safety margin) * batch_size
min_after_dequeue = 1000
saftey_margin = 3
capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
capacity = 2000

before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")

before_images, after_images, mask_images = (
tf.train.shuffle_batch(
[before_image, after_image, mask_image], batch_size=batch_size,
capacity=capacity, min_after_dequeue=min_after_dequeue,
num_threads=5, seed=1))

before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")









share|improve this question


























    0















    While training a pretty standard convolutional net, I discovered a weird bug. Everything starts out fine with a nice loss curve, but suddenly the loss goes to nan. I was able to trace the nans back all the way to the input pipeline.



    As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.



    What might be causing this? I have played around with different values of capacity, threads, etc.



    The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.



    I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.



    def input_pipeline(self, filenames, batch_size, num_epochs=None):
    """Function that creates a highly abstracted input pipeline consisting
    of a bunch of threads and queues given a few simple parameters.

    See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
    for more information and in-depth explanations.

    Args:
    - filenames: a list of filenames of tfrecords files
    random.shuffle(filenames)

    train_filenames = filenames
    train_filename_queue = (
    tf.train.string_input_producer(train_filenames,
    num_epochs=num_epochs,
    shuffle=True,
    seed=1))

    before_image, after_image, mask_image = (
    self._read_and_decode_reach_tfrecords(train_filename_queue))

    # min_after_dequeue defines how big a buffer we will randomly sample
    # from -- bigger means better shuffling but slower start up and more
    # memory used.
    # capacity must be larger than min_after_dequeue and the amount larger
    # determines the maximum we will prefetch. Recommendation:
    # min_after_dequeue + (num_threads + a safety margin) * batch_size
    min_after_dequeue = 1000
    saftey_margin = 3
    capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
    capacity = 2000

    before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
    mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")

    before_images, after_images, mask_images = (
    tf.train.shuffle_batch(
    [before_image, after_image, mask_image], batch_size=batch_size,
    capacity=capacity, min_after_dequeue=min_after_dequeue,
    num_threads=5, seed=1))

    before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")









    share|improve this question
























      0












      0








      0








      While training a pretty standard convolutional net, I discovered a weird bug. Everything starts out fine with a nice loss curve, but suddenly the loss goes to nan. I was able to trace the nans back all the way to the input pipeline.



      As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.



      What might be causing this? I have played around with different values of capacity, threads, etc.



      The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.



      I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.



      def input_pipeline(self, filenames, batch_size, num_epochs=None):
      """Function that creates a highly abstracted input pipeline consisting
      of a bunch of threads and queues given a few simple parameters.

      See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
      for more information and in-depth explanations.

      Args:
      - filenames: a list of filenames of tfrecords files
      random.shuffle(filenames)

      train_filenames = filenames
      train_filename_queue = (
      tf.train.string_input_producer(train_filenames,
      num_epochs=num_epochs,
      shuffle=True,
      seed=1))

      before_image, after_image, mask_image = (
      self._read_and_decode_reach_tfrecords(train_filename_queue))

      # min_after_dequeue defines how big a buffer we will randomly sample
      # from -- bigger means better shuffling but slower start up and more
      # memory used.
      # capacity must be larger than min_after_dequeue and the amount larger
      # determines the maximum we will prefetch. Recommendation:
      # min_after_dequeue + (num_threads + a safety margin) * batch_size
      min_after_dequeue = 1000
      saftey_margin = 3
      capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
      capacity = 2000

      before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
      mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")

      before_images, after_images, mask_images = (
      tf.train.shuffle_batch(
      [before_image, after_image, mask_image], batch_size=batch_size,
      capacity=capacity, min_after_dequeue=min_after_dequeue,
      num_threads=5, seed=1))

      before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")









      share|improve this question














      While training a pretty standard convolutional net, I discovered a weird bug. Everything starts out fine with a nice loss curve, but suddenly the loss goes to nan. I was able to trace the nans back all the way to the input pipeline.



      As you can see, I am printing the errors before and after they are batched using tf.train.shuffle_batch(). The second print comes up as nan, and this is propagated all the way through the question.



      What might be causing this? I have played around with different values of capacity, threads, etc.



      The code and context is below. Nans are appearing in the batched before/after images, but not the before/after image.



      I should note that the tfrecord files have an arbitrary number of examples in them, but I believe that this shouldn't matter for the enqueue/dequeue operations.



      def input_pipeline(self, filenames, batch_size, num_epochs=None):
      """Function that creates a highly abstracted input pipeline consisting
      of a bunch of threads and queues given a few simple parameters.

      See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
      for more information and in-depth explanations.

      Args:
      - filenames: a list of filenames of tfrecords files
      random.shuffle(filenames)

      train_filenames = filenames
      train_filename_queue = (
      tf.train.string_input_producer(train_filenames,
      num_epochs=num_epochs,
      shuffle=True,
      seed=1))

      before_image, after_image, mask_image = (
      self._read_and_decode_reach_tfrecords(train_filename_queue))

      # min_after_dequeue defines how big a buffer we will randomly sample
      # from -- bigger means better shuffling but slower start up and more
      # memory used.
      # capacity must be larger than min_after_dequeue and the amount larger
      # determines the maximum we will prefetch. Recommendation:
      # min_after_dequeue + (num_threads + a safety margin) * batch_size
      min_after_dequeue = 1000
      saftey_margin = 3
      capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
      capacity = 2000

      before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
      mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")

      before_images, after_images, mask_images = (
      tf.train.shuffle_batch(
      [before_image, after_image, mask_image], batch_size=batch_size,
      capacity=capacity, min_after_dequeue=min_after_dequeue,
      num_threads=5, seed=1))

      before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")






      python tensorflow deep-learning






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 14 '18 at 1:57









      Michael Vander MeidenMichael Vander Meiden

      11




      11






















          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53292105%2ftf-train-shuffle-batch-returning-nans-after-random-number-of-iterations%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53292105%2ftf-train-shuffle-batch-returning-nans-after-random-number-of-iterations%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          What does pagestruct do in Eviews?

          Dutch intervention in Lombok and Karangasem

          Channel Islands