Bulk Unload from Redshift to S3 Interrupted










1















I wrote a python script that will do a bulk unload of all tables within a schema to s3, which scales to petabytes of data. While my script was running perfectly okay, my python script got interrupted due to a network disconnection.



Now, I'm in the midst of an unload job, unsure of how I can resume from the last point of failure. While I am debating about rerunning everything from the start, I am obviously thinking of using a Jenkins slave to run my job, however, I don't want to lose the hours of unload that had already completed. Is there a way to resume from where it stopped? My thinking is that it is hard to figure out the point to resume when the files are coming in as a ZIP.



What are the best practices I could use to avoid this in the future?



Any strategies to be able to pick up from where it left off?



Part of my code snippet that does unload:



#function to unload the data from tables into s3
def unloadData(passed_tables,schema,cur):
#loop through each table from the specified schema
for table in passed_tables:
#extract the table name string and store it in a variable
i=(table[0])
try:
#unload query to migrate the table to s3
unload='''unload('select * from 0.1') to 's3://<bucket>/2/3/'
iam_role 'arn:aws:iam::*****:role/***';'''.format(schema,i,schema,i)
cur.execute(unload)
print("Unload in progress! Check S3 bucket in a while to confirm.")
except Exception, e:
print("Failed to unload data! Try again or check the query.")
print(sys.stderr, "Exception: %s" % str(e))
sys.exit(1)









share|improve this question




























    1















    I wrote a python script that will do a bulk unload of all tables within a schema to s3, which scales to petabytes of data. While my script was running perfectly okay, my python script got interrupted due to a network disconnection.



    Now, I'm in the midst of an unload job, unsure of how I can resume from the last point of failure. While I am debating about rerunning everything from the start, I am obviously thinking of using a Jenkins slave to run my job, however, I don't want to lose the hours of unload that had already completed. Is there a way to resume from where it stopped? My thinking is that it is hard to figure out the point to resume when the files are coming in as a ZIP.



    What are the best practices I could use to avoid this in the future?



    Any strategies to be able to pick up from where it left off?



    Part of my code snippet that does unload:



    #function to unload the data from tables into s3
    def unloadData(passed_tables,schema,cur):
    #loop through each table from the specified schema
    for table in passed_tables:
    #extract the table name string and store it in a variable
    i=(table[0])
    try:
    #unload query to migrate the table to s3
    unload='''unload('select * from 0.1') to 's3://<bucket>/2/3/'
    iam_role 'arn:aws:iam::*****:role/***';'''.format(schema,i,schema,i)
    cur.execute(unload)
    print("Unload in progress! Check S3 bucket in a while to confirm.")
    except Exception, e:
    print("Failed to unload data! Try again or check the query.")
    print(sys.stderr, "Exception: %s" % str(e))
    sys.exit(1)









    share|improve this question


























      1












      1








      1








      I wrote a python script that will do a bulk unload of all tables within a schema to s3, which scales to petabytes of data. While my script was running perfectly okay, my python script got interrupted due to a network disconnection.



      Now, I'm in the midst of an unload job, unsure of how I can resume from the last point of failure. While I am debating about rerunning everything from the start, I am obviously thinking of using a Jenkins slave to run my job, however, I don't want to lose the hours of unload that had already completed. Is there a way to resume from where it stopped? My thinking is that it is hard to figure out the point to resume when the files are coming in as a ZIP.



      What are the best practices I could use to avoid this in the future?



      Any strategies to be able to pick up from where it left off?



      Part of my code snippet that does unload:



      #function to unload the data from tables into s3
      def unloadData(passed_tables,schema,cur):
      #loop through each table from the specified schema
      for table in passed_tables:
      #extract the table name string and store it in a variable
      i=(table[0])
      try:
      #unload query to migrate the table to s3
      unload='''unload('select * from 0.1') to 's3://<bucket>/2/3/'
      iam_role 'arn:aws:iam::*****:role/***';'''.format(schema,i,schema,i)
      cur.execute(unload)
      print("Unload in progress! Check S3 bucket in a while to confirm.")
      except Exception, e:
      print("Failed to unload data! Try again or check the query.")
      print(sys.stderr, "Exception: %s" % str(e))
      sys.exit(1)









      share|improve this question
















      I wrote a python script that will do a bulk unload of all tables within a schema to s3, which scales to petabytes of data. While my script was running perfectly okay, my python script got interrupted due to a network disconnection.



      Now, I'm in the midst of an unload job, unsure of how I can resume from the last point of failure. While I am debating about rerunning everything from the start, I am obviously thinking of using a Jenkins slave to run my job, however, I don't want to lose the hours of unload that had already completed. Is there a way to resume from where it stopped? My thinking is that it is hard to figure out the point to resume when the files are coming in as a ZIP.



      What are the best practices I could use to avoid this in the future?



      Any strategies to be able to pick up from where it left off?



      Part of my code snippet that does unload:



      #function to unload the data from tables into s3
      def unloadData(passed_tables,schema,cur):
      #loop through each table from the specified schema
      for table in passed_tables:
      #extract the table name string and store it in a variable
      i=(table[0])
      try:
      #unload query to migrate the table to s3
      unload='''unload('select * from 0.1') to 's3://<bucket>/2/3/'
      iam_role 'arn:aws:iam::*****:role/***';'''.format(schema,i,schema,i)
      cur.execute(unload)
      print("Unload in progress! Check S3 bucket in a while to confirm.")
      except Exception, e:
      print("Failed to unload data! Try again or check the query.")
      print(sys.stderr, "Exception: %s" % str(e))
      sys.exit(1)






      python database amazon-redshift database-migration fault-tolerance






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 15 '18 at 0:40







      Praneeth Turlapati

















      asked Nov 15 '18 at 0:04









      Praneeth TurlapatiPraneeth Turlapati

      63




      63






















          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310613%2fbulk-unload-from-redshift-to-s3-interrupted%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310613%2fbulk-unload-from-redshift-to-s3-interrupted%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          How to read a connectionString WITH PROVIDER in .NET Core?

          Node.js Script on GitHub Pages or Amazon S3

          Museum of Modern and Contemporary Art of Trento and Rovereto