Airflow task with null status










1















I'am having an issue with airflow when running it on a 24xlarge machine on EC2.



I must note that the parallelism level is 256.



For some days the dagrun finishes with status 'failed' for two undetermined reasons :



  1. Some task has the status 'upstream_failed', which is not true because we can see clearly that all the previous steps where successful.
    enter image description here


  2. Other tasks have not the status 'null', they have not started yet and they cause the dagrun to fail.
    enter image description here


I must note that the logs for both of these tasks are empty



enter image description here



And here is the tast instance details for these cases :



enter image description here



Any solutions please ?










share|improve this question
























  • Operator is also null?

    – mad_
    Nov 15 '18 at 10:44











  • Yes it is always null

    – I.Chorfi
    Nov 15 '18 at 17:13















1















I'am having an issue with airflow when running it on a 24xlarge machine on EC2.



I must note that the parallelism level is 256.



For some days the dagrun finishes with status 'failed' for two undetermined reasons :



  1. Some task has the status 'upstream_failed', which is not true because we can see clearly that all the previous steps where successful.
    enter image description here


  2. Other tasks have not the status 'null', they have not started yet and they cause the dagrun to fail.
    enter image description here


I must note that the logs for both of these tasks are empty



enter image description here



And here is the tast instance details for these cases :



enter image description here



Any solutions please ?










share|improve this question
























  • Operator is also null?

    – mad_
    Nov 15 '18 at 10:44











  • Yes it is always null

    – I.Chorfi
    Nov 15 '18 at 17:13













1












1








1


0






I'am having an issue with airflow when running it on a 24xlarge machine on EC2.



I must note that the parallelism level is 256.



For some days the dagrun finishes with status 'failed' for two undetermined reasons :



  1. Some task has the status 'upstream_failed', which is not true because we can see clearly that all the previous steps where successful.
    enter image description here


  2. Other tasks have not the status 'null', they have not started yet and they cause the dagrun to fail.
    enter image description here


I must note that the logs for both of these tasks are empty



enter image description here



And here is the tast instance details for these cases :



enter image description here



Any solutions please ?










share|improve this question
















I'am having an issue with airflow when running it on a 24xlarge machine on EC2.



I must note that the parallelism level is 256.



For some days the dagrun finishes with status 'failed' for two undetermined reasons :



  1. Some task has the status 'upstream_failed', which is not true because we can see clearly that all the previous steps where successful.
    enter image description here


  2. Other tasks have not the status 'null', they have not started yet and they cause the dagrun to fail.
    enter image description here


I must note that the logs for both of these tasks are empty



enter image description here



And here is the tast instance details for these cases :



enter image description here



Any solutions please ?







python amazon-s3 airflow airflow-scheduler






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 16 '18 at 17:09







I.Chorfi

















asked Nov 15 '18 at 10:15









I.ChorfiI.Chorfi

335




335












  • Operator is also null?

    – mad_
    Nov 15 '18 at 10:44











  • Yes it is always null

    – I.Chorfi
    Nov 15 '18 at 17:13

















  • Operator is also null?

    – mad_
    Nov 15 '18 at 10:44











  • Yes it is always null

    – I.Chorfi
    Nov 15 '18 at 17:13
















Operator is also null?

– mad_
Nov 15 '18 at 10:44





Operator is also null?

– mad_
Nov 15 '18 at 10:44













Yes it is always null

– I.Chorfi
Nov 15 '18 at 17:13





Yes it is always null

– I.Chorfi
Nov 15 '18 at 17:13












2 Answers
2






active

oldest

votes


















0














This can happen when the task status was manually changed (likely through the "Mark Success" option), or forced into a state (as in upstream_failed) and the task never receives a hostname value on the record and wouldn't have any logs or PID






share|improve this answer























  • This is weird, because there was no manual intervention that took place.

    – I.Chorfi
    Nov 16 '18 at 17:09











  • The upstream_failed state is applied to tasks where they're unable to run due to failed dependencies.

    – joeb
    Nov 17 '18 at 0:49


















0














The other case where I've experienced the second condition ("Other tasks have not the status 'null'"), is when the task instance has changed, and specifically changed operator type.



I'm hoping you already got an answer / were able to move on. I've been stuck on this issue a few times in the last month, so I figured I would document what I ended up doing to solve the issue.




Example:



  • Task Instance originally is an instance of a SubDag Operator

  • Requirements cause the type of the operator to change from a SubDag Operator to a Python Operator

  • After the change, the Python Operator is set to state NULL


As best I can piece together, what's happening is:



  • Airflow is introspecting the operator associated with each task

  • Each task instance is logged into the database table task_instance

    • This table has an attribute called operator


  • When the scheduler re-introspects the code, it looks for the task_instance with the correct operator type; not seeing it, it updates the associated database record(s) as state = 'removed'

  • When the DAG subsequently schedules, airflow

You can see tasks impacted by this process with the query:



SELECT *
FROM task_instance
WHERE state = 'removed'


It looks like there's been work on this issue for airflow 1.10:



  • https://github.com/aliceabe/incubator-airflow/commit/b6f6c732700d1e53793c96ca74b0e2dc1e10405e

That being said, I'm not 100% sure based on the commits that I can find that it would resolve this issue. It seems like the general philosophy is still "when a DAG changes, you should increment / change the DAG name".



I don't love that solution, because it makes it hard to iterate on what is fundamentally one pipeline. The alternative I used was to follow (partially) the recommendations from Astronomer and "blow out" the DAG history. In order to do that, you need to:



  • Stop the scheduler


  • Delete the history from the dag

    • This should result in the DAG completely disappearing from the web UI

    • If it doesn't completely disappear, somewhere the scheduler is still running


  • Restart the scheduler

    • Note: if you're running the DAG on a schedule, be prepared for it to backfill / catchup / run its latest schedule, because you've removed the history

    • If you don't want it to do this, Astronomer's "Fast Forward a DAG" suggestions could be applied






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53317104%2fairflow-task-with-null-status%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    This can happen when the task status was manually changed (likely through the "Mark Success" option), or forced into a state (as in upstream_failed) and the task never receives a hostname value on the record and wouldn't have any logs or PID






    share|improve this answer























    • This is weird, because there was no manual intervention that took place.

      – I.Chorfi
      Nov 16 '18 at 17:09











    • The upstream_failed state is applied to tasks where they're unable to run due to failed dependencies.

      – joeb
      Nov 17 '18 at 0:49















    0














    This can happen when the task status was manually changed (likely through the "Mark Success" option), or forced into a state (as in upstream_failed) and the task never receives a hostname value on the record and wouldn't have any logs or PID






    share|improve this answer























    • This is weird, because there was no manual intervention that took place.

      – I.Chorfi
      Nov 16 '18 at 17:09











    • The upstream_failed state is applied to tasks where they're unable to run due to failed dependencies.

      – joeb
      Nov 17 '18 at 0:49













    0












    0








    0







    This can happen when the task status was manually changed (likely through the "Mark Success" option), or forced into a state (as in upstream_failed) and the task never receives a hostname value on the record and wouldn't have any logs or PID






    share|improve this answer













    This can happen when the task status was manually changed (likely through the "Mark Success" option), or forced into a state (as in upstream_failed) and the task never receives a hostname value on the record and wouldn't have any logs or PID







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 16 '18 at 1:51









    joebjoeb

    2,22611519




    2,22611519












    • This is weird, because there was no manual intervention that took place.

      – I.Chorfi
      Nov 16 '18 at 17:09











    • The upstream_failed state is applied to tasks where they're unable to run due to failed dependencies.

      – joeb
      Nov 17 '18 at 0:49

















    • This is weird, because there was no manual intervention that took place.

      – I.Chorfi
      Nov 16 '18 at 17:09











    • The upstream_failed state is applied to tasks where they're unable to run due to failed dependencies.

      – joeb
      Nov 17 '18 at 0:49
















    This is weird, because there was no manual intervention that took place.

    – I.Chorfi
    Nov 16 '18 at 17:09





    This is weird, because there was no manual intervention that took place.

    – I.Chorfi
    Nov 16 '18 at 17:09













    The upstream_failed state is applied to tasks where they're unable to run due to failed dependencies.

    – joeb
    Nov 17 '18 at 0:49





    The upstream_failed state is applied to tasks where they're unable to run due to failed dependencies.

    – joeb
    Nov 17 '18 at 0:49













    0














    The other case where I've experienced the second condition ("Other tasks have not the status 'null'"), is when the task instance has changed, and specifically changed operator type.



    I'm hoping you already got an answer / were able to move on. I've been stuck on this issue a few times in the last month, so I figured I would document what I ended up doing to solve the issue.




    Example:



    • Task Instance originally is an instance of a SubDag Operator

    • Requirements cause the type of the operator to change from a SubDag Operator to a Python Operator

    • After the change, the Python Operator is set to state NULL


    As best I can piece together, what's happening is:



    • Airflow is introspecting the operator associated with each task

    • Each task instance is logged into the database table task_instance

      • This table has an attribute called operator


    • When the scheduler re-introspects the code, it looks for the task_instance with the correct operator type; not seeing it, it updates the associated database record(s) as state = 'removed'

    • When the DAG subsequently schedules, airflow

    You can see tasks impacted by this process with the query:



    SELECT *
    FROM task_instance
    WHERE state = 'removed'


    It looks like there's been work on this issue for airflow 1.10:



    • https://github.com/aliceabe/incubator-airflow/commit/b6f6c732700d1e53793c96ca74b0e2dc1e10405e

    That being said, I'm not 100% sure based on the commits that I can find that it would resolve this issue. It seems like the general philosophy is still "when a DAG changes, you should increment / change the DAG name".



    I don't love that solution, because it makes it hard to iterate on what is fundamentally one pipeline. The alternative I used was to follow (partially) the recommendations from Astronomer and "blow out" the DAG history. In order to do that, you need to:



    • Stop the scheduler


    • Delete the history from the dag

      • This should result in the DAG completely disappearing from the web UI

      • If it doesn't completely disappear, somewhere the scheduler is still running


    • Restart the scheduler

      • Note: if you're running the DAG on a schedule, be prepared for it to backfill / catchup / run its latest schedule, because you've removed the history

      • If you don't want it to do this, Astronomer's "Fast Forward a DAG" suggestions could be applied






    share|improve this answer



























      0














      The other case where I've experienced the second condition ("Other tasks have not the status 'null'"), is when the task instance has changed, and specifically changed operator type.



      I'm hoping you already got an answer / were able to move on. I've been stuck on this issue a few times in the last month, so I figured I would document what I ended up doing to solve the issue.




      Example:



      • Task Instance originally is an instance of a SubDag Operator

      • Requirements cause the type of the operator to change from a SubDag Operator to a Python Operator

      • After the change, the Python Operator is set to state NULL


      As best I can piece together, what's happening is:



      • Airflow is introspecting the operator associated with each task

      • Each task instance is logged into the database table task_instance

        • This table has an attribute called operator


      • When the scheduler re-introspects the code, it looks for the task_instance with the correct operator type; not seeing it, it updates the associated database record(s) as state = 'removed'

      • When the DAG subsequently schedules, airflow

      You can see tasks impacted by this process with the query:



      SELECT *
      FROM task_instance
      WHERE state = 'removed'


      It looks like there's been work on this issue for airflow 1.10:



      • https://github.com/aliceabe/incubator-airflow/commit/b6f6c732700d1e53793c96ca74b0e2dc1e10405e

      That being said, I'm not 100% sure based on the commits that I can find that it would resolve this issue. It seems like the general philosophy is still "when a DAG changes, you should increment / change the DAG name".



      I don't love that solution, because it makes it hard to iterate on what is fundamentally one pipeline. The alternative I used was to follow (partially) the recommendations from Astronomer and "blow out" the DAG history. In order to do that, you need to:



      • Stop the scheduler


      • Delete the history from the dag

        • This should result in the DAG completely disappearing from the web UI

        • If it doesn't completely disappear, somewhere the scheduler is still running


      • Restart the scheduler

        • Note: if you're running the DAG on a schedule, be prepared for it to backfill / catchup / run its latest schedule, because you've removed the history

        • If you don't want it to do this, Astronomer's "Fast Forward a DAG" suggestions could be applied






      share|improve this answer

























        0












        0








        0







        The other case where I've experienced the second condition ("Other tasks have not the status 'null'"), is when the task instance has changed, and specifically changed operator type.



        I'm hoping you already got an answer / were able to move on. I've been stuck on this issue a few times in the last month, so I figured I would document what I ended up doing to solve the issue.




        Example:



        • Task Instance originally is an instance of a SubDag Operator

        • Requirements cause the type of the operator to change from a SubDag Operator to a Python Operator

        • After the change, the Python Operator is set to state NULL


        As best I can piece together, what's happening is:



        • Airflow is introspecting the operator associated with each task

        • Each task instance is logged into the database table task_instance

          • This table has an attribute called operator


        • When the scheduler re-introspects the code, it looks for the task_instance with the correct operator type; not seeing it, it updates the associated database record(s) as state = 'removed'

        • When the DAG subsequently schedules, airflow

        You can see tasks impacted by this process with the query:



        SELECT *
        FROM task_instance
        WHERE state = 'removed'


        It looks like there's been work on this issue for airflow 1.10:



        • https://github.com/aliceabe/incubator-airflow/commit/b6f6c732700d1e53793c96ca74b0e2dc1e10405e

        That being said, I'm not 100% sure based on the commits that I can find that it would resolve this issue. It seems like the general philosophy is still "when a DAG changes, you should increment / change the DAG name".



        I don't love that solution, because it makes it hard to iterate on what is fundamentally one pipeline. The alternative I used was to follow (partially) the recommendations from Astronomer and "blow out" the DAG history. In order to do that, you need to:



        • Stop the scheduler


        • Delete the history from the dag

          • This should result in the DAG completely disappearing from the web UI

          • If it doesn't completely disappear, somewhere the scheduler is still running


        • Restart the scheduler

          • Note: if you're running the DAG on a schedule, be prepared for it to backfill / catchup / run its latest schedule, because you've removed the history

          • If you don't want it to do this, Astronomer's "Fast Forward a DAG" suggestions could be applied






        share|improve this answer













        The other case where I've experienced the second condition ("Other tasks have not the status 'null'"), is when the task instance has changed, and specifically changed operator type.



        I'm hoping you already got an answer / were able to move on. I've been stuck on this issue a few times in the last month, so I figured I would document what I ended up doing to solve the issue.




        Example:



        • Task Instance originally is an instance of a SubDag Operator

        • Requirements cause the type of the operator to change from a SubDag Operator to a Python Operator

        • After the change, the Python Operator is set to state NULL


        As best I can piece together, what's happening is:



        • Airflow is introspecting the operator associated with each task

        • Each task instance is logged into the database table task_instance

          • This table has an attribute called operator


        • When the scheduler re-introspects the code, it looks for the task_instance with the correct operator type; not seeing it, it updates the associated database record(s) as state = 'removed'

        • When the DAG subsequently schedules, airflow

        You can see tasks impacted by this process with the query:



        SELECT *
        FROM task_instance
        WHERE state = 'removed'


        It looks like there's been work on this issue for airflow 1.10:



        • https://github.com/aliceabe/incubator-airflow/commit/b6f6c732700d1e53793c96ca74b0e2dc1e10405e

        That being said, I'm not 100% sure based on the commits that I can find that it would resolve this issue. It seems like the general philosophy is still "when a DAG changes, you should increment / change the DAG name".



        I don't love that solution, because it makes it hard to iterate on what is fundamentally one pipeline. The alternative I used was to follow (partially) the recommendations from Astronomer and "blow out" the DAG history. In order to do that, you need to:



        • Stop the scheduler


        • Delete the history from the dag

          • This should result in the DAG completely disappearing from the web UI

          • If it doesn't completely disappear, somewhere the scheduler is still running


        • Restart the scheduler

          • Note: if you're running the DAG on a schedule, be prepared for it to backfill / catchup / run its latest schedule, because you've removed the history

          • If you don't want it to do this, Astronomer's "Fast Forward a DAG" suggestions could be applied







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Feb 27 at 0:45









        Adam BethkeAdam Bethke

        56111021




        56111021



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53317104%2fairflow-task-with-null-status%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Barbados

            How to read a connectionString WITH PROVIDER in .NET Core?

            Node.js Script on GitHub Pages or Amazon S3