Pandas Series value_counts working differently for different counts










6















For example:



df1 = pd.DataFrame(np.repeat(np.arange(1,7),3), columns=['A'])

df1.A.value_counts(sort=False)
1 3
2 3
3 3
4 3
5 3
6 3
Name: A, dtype: int64



df2 = pd.DataFrame(np.repeat(np.arange(1,7),100), columns=['A'])

df2.A.value_counts(sort=False)
1 100
2 100
3 100
4 100
5 100
6 100
Name: A, dtype: int64



In the above examples the value_counts works perfectly and give the required result. whereas when coming to larger dataframes it is giving a different output. Here the A values are already sorted and counts are also same, but the order of index that is A changed after value_counts. Why is it doing correctly for small counts but not for large counts:



df3 = pd.DataFrame(np.repeat(np.arange(1,7),1000), columns=['A'])

df3.A.value_counts(sort=False)
4 1000
1 1000
5 1000
2 1000
6 1000
3 1000
Name: A, dtype: int64


Here I can do df3.A.value_counts(sort=False).sort_index() or df3.A.value_counts(sort=False).reindex(df.A.unique()). I want to know the reason why it is behaving differently for different counts?



Using:



Numpy version :1.15.2
Pandas version :0.23.4









share|improve this question




























    6















    For example:



    df1 = pd.DataFrame(np.repeat(np.arange(1,7),3), columns=['A'])

    df1.A.value_counts(sort=False)
    1 3
    2 3
    3 3
    4 3
    5 3
    6 3
    Name: A, dtype: int64



    df2 = pd.DataFrame(np.repeat(np.arange(1,7),100), columns=['A'])

    df2.A.value_counts(sort=False)
    1 100
    2 100
    3 100
    4 100
    5 100
    6 100
    Name: A, dtype: int64



    In the above examples the value_counts works perfectly and give the required result. whereas when coming to larger dataframes it is giving a different output. Here the A values are already sorted and counts are also same, but the order of index that is A changed after value_counts. Why is it doing correctly for small counts but not for large counts:



    df3 = pd.DataFrame(np.repeat(np.arange(1,7),1000), columns=['A'])

    df3.A.value_counts(sort=False)
    4 1000
    1 1000
    5 1000
    2 1000
    6 1000
    3 1000
    Name: A, dtype: int64


    Here I can do df3.A.value_counts(sort=False).sort_index() or df3.A.value_counts(sort=False).reindex(df.A.unique()). I want to know the reason why it is behaving differently for different counts?



    Using:



    Numpy version :1.15.2
    Pandas version :0.23.4









    share|improve this question


























      6












      6








      6








      For example:



      df1 = pd.DataFrame(np.repeat(np.arange(1,7),3), columns=['A'])

      df1.A.value_counts(sort=False)
      1 3
      2 3
      3 3
      4 3
      5 3
      6 3
      Name: A, dtype: int64



      df2 = pd.DataFrame(np.repeat(np.arange(1,7),100), columns=['A'])

      df2.A.value_counts(sort=False)
      1 100
      2 100
      3 100
      4 100
      5 100
      6 100
      Name: A, dtype: int64



      In the above examples the value_counts works perfectly and give the required result. whereas when coming to larger dataframes it is giving a different output. Here the A values are already sorted and counts are also same, but the order of index that is A changed after value_counts. Why is it doing correctly for small counts but not for large counts:



      df3 = pd.DataFrame(np.repeat(np.arange(1,7),1000), columns=['A'])

      df3.A.value_counts(sort=False)
      4 1000
      1 1000
      5 1000
      2 1000
      6 1000
      3 1000
      Name: A, dtype: int64


      Here I can do df3.A.value_counts(sort=False).sort_index() or df3.A.value_counts(sort=False).reindex(df.A.unique()). I want to know the reason why it is behaving differently for different counts?



      Using:



      Numpy version :1.15.2
      Pandas version :0.23.4









      share|improve this question
















      For example:



      df1 = pd.DataFrame(np.repeat(np.arange(1,7),3), columns=['A'])

      df1.A.value_counts(sort=False)
      1 3
      2 3
      3 3
      4 3
      5 3
      6 3
      Name: A, dtype: int64



      df2 = pd.DataFrame(np.repeat(np.arange(1,7),100), columns=['A'])

      df2.A.value_counts(sort=False)
      1 100
      2 100
      3 100
      4 100
      5 100
      6 100
      Name: A, dtype: int64



      In the above examples the value_counts works perfectly and give the required result. whereas when coming to larger dataframes it is giving a different output. Here the A values are already sorted and counts are also same, but the order of index that is A changed after value_counts. Why is it doing correctly for small counts but not for large counts:



      df3 = pd.DataFrame(np.repeat(np.arange(1,7),1000), columns=['A'])

      df3.A.value_counts(sort=False)
      4 1000
      1 1000
      5 1000
      2 1000
      6 1000
      3 1000
      Name: A, dtype: int64


      Here I can do df3.A.value_counts(sort=False).sort_index() or df3.A.value_counts(sort=False).reindex(df.A.unique()). I want to know the reason why it is behaving differently for different counts?



      Using:



      Numpy version :1.15.2
      Pandas version :0.23.4






      python pandas numpy






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 14 '18 at 6:29







      Sandeep Kadapa

















      asked Nov 14 '18 at 6:22









      Sandeep KadapaSandeep Kadapa

      7,043830




      7,043830






















          1 Answer
          1






          active

          oldest

          votes


















          2














          This is actually a known problem.



          If you browse through the source code -




          1. C:ProgramDataAnaconda3Libsite-packagespandascorealgorithims.py line 581 is the original implementation

          2. It calls _value_counts_arraylike for int64 values when bins=None

          3. This function makes a call - keys, counts = htable.value_count_int64(values, dropna)

          If you then look at the htable implementation you will conclude that the keys are in an arbitrary order, subject to how the hashtable works.



          Its not a guarantee of ANY kind of ordering. Typically this routine sorts by biggest values, and that is almost always what you want.



          I guess they can change this to have sort=False mean original ordering. I don't know if this would actually break anything (and done internally this isn't very costly as the uniques are already known).



          The order is changed from pandas/hashtable.pyx.build_count_table_object(). Resizing of the pymap moves the entries by hashing values.



          Here is the full discussion






          share|improve this answer























          • When keys are in an arbitrary order, it must not guarantee for any number of counts right? but if the counts are small it is preserving the order and if the counts are high it isn't. Also, it maintains order till 341 counts but fails after it.

            – Sandeep Kadapa
            Nov 14 '18 at 7:05






          • 1





            It looks to be some update for orignal order retaining github page look to be using same update as reindex(unique(values)) :-) that's only way, So one have to reindex to preseve the orignal ordering

            – pygo
            Nov 14 '18 at 8:16











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53294222%2fpandas-series-value-counts-working-differently-for-different-counts%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          This is actually a known problem.



          If you browse through the source code -




          1. C:ProgramDataAnaconda3Libsite-packagespandascorealgorithims.py line 581 is the original implementation

          2. It calls _value_counts_arraylike for int64 values when bins=None

          3. This function makes a call - keys, counts = htable.value_count_int64(values, dropna)

          If you then look at the htable implementation you will conclude that the keys are in an arbitrary order, subject to how the hashtable works.



          Its not a guarantee of ANY kind of ordering. Typically this routine sorts by biggest values, and that is almost always what you want.



          I guess they can change this to have sort=False mean original ordering. I don't know if this would actually break anything (and done internally this isn't very costly as the uniques are already known).



          The order is changed from pandas/hashtable.pyx.build_count_table_object(). Resizing of the pymap moves the entries by hashing values.



          Here is the full discussion






          share|improve this answer























          • When keys are in an arbitrary order, it must not guarantee for any number of counts right? but if the counts are small it is preserving the order and if the counts are high it isn't. Also, it maintains order till 341 counts but fails after it.

            – Sandeep Kadapa
            Nov 14 '18 at 7:05






          • 1





            It looks to be some update for orignal order retaining github page look to be using same update as reindex(unique(values)) :-) that's only way, So one have to reindex to preseve the orignal ordering

            – pygo
            Nov 14 '18 at 8:16
















          2














          This is actually a known problem.



          If you browse through the source code -




          1. C:ProgramDataAnaconda3Libsite-packagespandascorealgorithims.py line 581 is the original implementation

          2. It calls _value_counts_arraylike for int64 values when bins=None

          3. This function makes a call - keys, counts = htable.value_count_int64(values, dropna)

          If you then look at the htable implementation you will conclude that the keys are in an arbitrary order, subject to how the hashtable works.



          Its not a guarantee of ANY kind of ordering. Typically this routine sorts by biggest values, and that is almost always what you want.



          I guess they can change this to have sort=False mean original ordering. I don't know if this would actually break anything (and done internally this isn't very costly as the uniques are already known).



          The order is changed from pandas/hashtable.pyx.build_count_table_object(). Resizing of the pymap moves the entries by hashing values.



          Here is the full discussion






          share|improve this answer























          • When keys are in an arbitrary order, it must not guarantee for any number of counts right? but if the counts are small it is preserving the order and if the counts are high it isn't. Also, it maintains order till 341 counts but fails after it.

            – Sandeep Kadapa
            Nov 14 '18 at 7:05






          • 1





            It looks to be some update for orignal order retaining github page look to be using same update as reindex(unique(values)) :-) that's only way, So one have to reindex to preseve the orignal ordering

            – pygo
            Nov 14 '18 at 8:16














          2












          2








          2







          This is actually a known problem.



          If you browse through the source code -




          1. C:ProgramDataAnaconda3Libsite-packagespandascorealgorithims.py line 581 is the original implementation

          2. It calls _value_counts_arraylike for int64 values when bins=None

          3. This function makes a call - keys, counts = htable.value_count_int64(values, dropna)

          If you then look at the htable implementation you will conclude that the keys are in an arbitrary order, subject to how the hashtable works.



          Its not a guarantee of ANY kind of ordering. Typically this routine sorts by biggest values, and that is almost always what you want.



          I guess they can change this to have sort=False mean original ordering. I don't know if this would actually break anything (and done internally this isn't very costly as the uniques are already known).



          The order is changed from pandas/hashtable.pyx.build_count_table_object(). Resizing of the pymap moves the entries by hashing values.



          Here is the full discussion






          share|improve this answer













          This is actually a known problem.



          If you browse through the source code -




          1. C:ProgramDataAnaconda3Libsite-packagespandascorealgorithims.py line 581 is the original implementation

          2. It calls _value_counts_arraylike for int64 values when bins=None

          3. This function makes a call - keys, counts = htable.value_count_int64(values, dropna)

          If you then look at the htable implementation you will conclude that the keys are in an arbitrary order, subject to how the hashtable works.



          Its not a guarantee of ANY kind of ordering. Typically this routine sorts by biggest values, and that is almost always what you want.



          I guess they can change this to have sort=False mean original ordering. I don't know if this would actually break anything (and done internally this isn't very costly as the uniques are already known).



          The order is changed from pandas/hashtable.pyx.build_count_table_object(). Resizing of the pymap moves the entries by hashing values.



          Here is the full discussion







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 14 '18 at 6:44









          Vivek KalyanaranganVivek Kalyanarangan

          5,0961827




          5,0961827












          • When keys are in an arbitrary order, it must not guarantee for any number of counts right? but if the counts are small it is preserving the order and if the counts are high it isn't. Also, it maintains order till 341 counts but fails after it.

            – Sandeep Kadapa
            Nov 14 '18 at 7:05






          • 1





            It looks to be some update for orignal order retaining github page look to be using same update as reindex(unique(values)) :-) that's only way, So one have to reindex to preseve the orignal ordering

            – pygo
            Nov 14 '18 at 8:16


















          • When keys are in an arbitrary order, it must not guarantee for any number of counts right? but if the counts are small it is preserving the order and if the counts are high it isn't. Also, it maintains order till 341 counts but fails after it.

            – Sandeep Kadapa
            Nov 14 '18 at 7:05






          • 1





            It looks to be some update for orignal order retaining github page look to be using same update as reindex(unique(values)) :-) that's only way, So one have to reindex to preseve the orignal ordering

            – pygo
            Nov 14 '18 at 8:16

















          When keys are in an arbitrary order, it must not guarantee for any number of counts right? but if the counts are small it is preserving the order and if the counts are high it isn't. Also, it maintains order till 341 counts but fails after it.

          – Sandeep Kadapa
          Nov 14 '18 at 7:05





          When keys are in an arbitrary order, it must not guarantee for any number of counts right? but if the counts are small it is preserving the order and if the counts are high it isn't. Also, it maintains order till 341 counts but fails after it.

          – Sandeep Kadapa
          Nov 14 '18 at 7:05




          1




          1





          It looks to be some update for orignal order retaining github page look to be using same update as reindex(unique(values)) :-) that's only way, So one have to reindex to preseve the orignal ordering

          – pygo
          Nov 14 '18 at 8:16






          It looks to be some update for orignal order retaining github page look to be using same update as reindex(unique(values)) :-) that's only way, So one have to reindex to preseve the orignal ordering

          – pygo
          Nov 14 '18 at 8:16




















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53294222%2fpandas-series-value-counts-working-differently-for-different-counts%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          How to read a connectionString WITH PROVIDER in .NET Core?

          Node.js Script on GitHub Pages or Amazon S3

          Museum of Modern and Contemporary Art of Trento and Rovereto