Sort column in pandas dataframe after rarity of values within groups










1














I have a pandas dataframe of scraped websites with a website identifier, a text and a label of the websites. A small number of websites have two labels, but since I want to train first a single label classifier, I would like to create a version of the data with only one label for every website (I'm aware that this is slightly problematic). The labels in my dataset are unbalanced (with some labels occurring very often and some being very rare). If I delete duplicate website IDs, I would like to delete labels that are very common first. This is how my dataset with several labels looks like:



ID Label Text
1 a some text
1 b other text
1 a data
2 a words
2 c more words
3 a text
3 b short text


My idea was to sort the label column within every website identifier by rarity of the label. For that I would first do value_counts(ascending = True) on the label column, to get a list of all labels sorted by rarity.



to_sort = [c, b, a]


I then would like to use that list to sort within every website ID by rarity. I'm not sure how to do that, though. The result should look like this:



ID Label Text
1 b other text
1 a some text
1 a data
2 c more words
2 a words
3 b short text
3 a text


I then would use df.drop_duplicates(subset = 'ID', keep = 'first'), to keep the label that is the most rare. How can I do the sorting?










share|improve this question


























    1














    I have a pandas dataframe of scraped websites with a website identifier, a text and a label of the websites. A small number of websites have two labels, but since I want to train first a single label classifier, I would like to create a version of the data with only one label for every website (I'm aware that this is slightly problematic). The labels in my dataset are unbalanced (with some labels occurring very often and some being very rare). If I delete duplicate website IDs, I would like to delete labels that are very common first. This is how my dataset with several labels looks like:



    ID Label Text
    1 a some text
    1 b other text
    1 a data
    2 a words
    2 c more words
    3 a text
    3 b short text


    My idea was to sort the label column within every website identifier by rarity of the label. For that I would first do value_counts(ascending = True) on the label column, to get a list of all labels sorted by rarity.



    to_sort = [c, b, a]


    I then would like to use that list to sort within every website ID by rarity. I'm not sure how to do that, though. The result should look like this:



    ID Label Text
    1 b other text
    1 a some text
    1 a data
    2 c more words
    2 a words
    3 b short text
    3 a text


    I then would use df.drop_duplicates(subset = 'ID', keep = 'first'), to keep the label that is the most rare. How can I do the sorting?










    share|improve this question
























      1












      1








      1







      I have a pandas dataframe of scraped websites with a website identifier, a text and a label of the websites. A small number of websites have two labels, but since I want to train first a single label classifier, I would like to create a version of the data with only one label for every website (I'm aware that this is slightly problematic). The labels in my dataset are unbalanced (with some labels occurring very often and some being very rare). If I delete duplicate website IDs, I would like to delete labels that are very common first. This is how my dataset with several labels looks like:



      ID Label Text
      1 a some text
      1 b other text
      1 a data
      2 a words
      2 c more words
      3 a text
      3 b short text


      My idea was to sort the label column within every website identifier by rarity of the label. For that I would first do value_counts(ascending = True) on the label column, to get a list of all labels sorted by rarity.



      to_sort = [c, b, a]


      I then would like to use that list to sort within every website ID by rarity. I'm not sure how to do that, though. The result should look like this:



      ID Label Text
      1 b other text
      1 a some text
      1 a data
      2 c more words
      2 a words
      3 b short text
      3 a text


      I then would use df.drop_duplicates(subset = 'ID', keep = 'first'), to keep the label that is the most rare. How can I do the sorting?










      share|improve this question













      I have a pandas dataframe of scraped websites with a website identifier, a text and a label of the websites. A small number of websites have two labels, but since I want to train first a single label classifier, I would like to create a version of the data with only one label for every website (I'm aware that this is slightly problematic). The labels in my dataset are unbalanced (with some labels occurring very often and some being very rare). If I delete duplicate website IDs, I would like to delete labels that are very common first. This is how my dataset with several labels looks like:



      ID Label Text
      1 a some text
      1 b other text
      1 a data
      2 a words
      2 c more words
      3 a text
      3 b short text


      My idea was to sort the label column within every website identifier by rarity of the label. For that I would first do value_counts(ascending = True) on the label column, to get a list of all labels sorted by rarity.



      to_sort = [c, b, a]


      I then would like to use that list to sort within every website ID by rarity. I'm not sure how to do that, though. The result should look like this:



      ID Label Text
      1 b other text
      1 a some text
      1 a data
      2 c more words
      2 a words
      3 b short text
      3 a text


      I then would use df.drop_duplicates(subset = 'ID', keep = 'first'), to keep the label that is the most rare. How can I do the sorting?







      python pandas






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 12 at 15:17









      Amelia Bones

      285




      285






















          2 Answers
          2






          active

          oldest

          votes


















          0














          Use ordered categorical, so possible use sort_values:



          to_sort = list('cba')

          df['Label'] = pd.Categorical(df['Label'], ordered=True, categories=to_sort)

          df = df.sort_values(['ID','Label'])
          print (df)
          ID Label Text
          1 1 b other text
          0 1 a some text
          2 1 a data
          4 2 c more words
          3 2 a words
          6 3 b short text
          5 3 a text





          share|improve this answer
















          • 1




            Thanks a lot! Very helpful
            – Amelia Bones
            Nov 12 at 15:32


















          0














          You can achieve your goal by making the Label Column Categorical, then sort by ID and Label . Let's see it in practice.



          import pandas as pd
          df = pd.DataFrame( 'ID': [1,1,1,2,2,3,3], "Label": ["a", "b", "a", "a", "c", "a", "b"],
          'Text': ["some text", "other text","data", "words", "more words", "text", "short text"] )
          df
          ID Label Text
          0 1 a some text
          1 1 b other text
          2 1 a data
          3 2 a words
          4 2 c more words
          5 3 a text
          6 3 b short text


          Define your labels' order by doing :



          to_sort = df.Label.value_counts(ascending = True).index
          to_sort
          Index(['c', 'b', 'a'], dtype='object')


          Then make the Label column Categorical like this :



          df.Label = pd.Categorical(df.Label,categories = to_sort, ordered = True)


          Finally, sort by ID and Label :



          df.sort_values(["ID", "Label"]).reset_index(drop = True)

          ID Label Text
          0 1 b other text
          1 1 a some text
          2 1 a data
          3 2 c more words
          4 2 a words
          5 3 b short text
          6 3 a text





          share|improve this answer




















          • So why you post your answer? OP know df.Label.value_counts(ascending = True).index and .reset_index(drop = True) is only small difference with my answer :(
            – jezrael
            Nov 12 at 15:55










          • I wrote mine independently and only saw yours after posting mine. I avoid, however, to delete it because it is more detailed.
            – Neroksi
            Nov 12 at 16:40










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53265095%2fsort-column-in-pandas-dataframe-after-rarity-of-values-within-groups%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Use ordered categorical, so possible use sort_values:



          to_sort = list('cba')

          df['Label'] = pd.Categorical(df['Label'], ordered=True, categories=to_sort)

          df = df.sort_values(['ID','Label'])
          print (df)
          ID Label Text
          1 1 b other text
          0 1 a some text
          2 1 a data
          4 2 c more words
          3 2 a words
          6 3 b short text
          5 3 a text





          share|improve this answer
















          • 1




            Thanks a lot! Very helpful
            – Amelia Bones
            Nov 12 at 15:32















          0














          Use ordered categorical, so possible use sort_values:



          to_sort = list('cba')

          df['Label'] = pd.Categorical(df['Label'], ordered=True, categories=to_sort)

          df = df.sort_values(['ID','Label'])
          print (df)
          ID Label Text
          1 1 b other text
          0 1 a some text
          2 1 a data
          4 2 c more words
          3 2 a words
          6 3 b short text
          5 3 a text





          share|improve this answer
















          • 1




            Thanks a lot! Very helpful
            – Amelia Bones
            Nov 12 at 15:32













          0












          0








          0






          Use ordered categorical, so possible use sort_values:



          to_sort = list('cba')

          df['Label'] = pd.Categorical(df['Label'], ordered=True, categories=to_sort)

          df = df.sort_values(['ID','Label'])
          print (df)
          ID Label Text
          1 1 b other text
          0 1 a some text
          2 1 a data
          4 2 c more words
          3 2 a words
          6 3 b short text
          5 3 a text





          share|improve this answer












          Use ordered categorical, so possible use sort_values:



          to_sort = list('cba')

          df['Label'] = pd.Categorical(df['Label'], ordered=True, categories=to_sort)

          df = df.sort_values(['ID','Label'])
          print (df)
          ID Label Text
          1 1 b other text
          0 1 a some text
          2 1 a data
          4 2 c more words
          3 2 a words
          6 3 b short text
          5 3 a text






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 12 at 15:19









          jezrael

          320k22259338




          320k22259338







          • 1




            Thanks a lot! Very helpful
            – Amelia Bones
            Nov 12 at 15:32












          • 1




            Thanks a lot! Very helpful
            – Amelia Bones
            Nov 12 at 15:32







          1




          1




          Thanks a lot! Very helpful
          – Amelia Bones
          Nov 12 at 15:32




          Thanks a lot! Very helpful
          – Amelia Bones
          Nov 12 at 15:32













          0














          You can achieve your goal by making the Label Column Categorical, then sort by ID and Label . Let's see it in practice.



          import pandas as pd
          df = pd.DataFrame( 'ID': [1,1,1,2,2,3,3], "Label": ["a", "b", "a", "a", "c", "a", "b"],
          'Text': ["some text", "other text","data", "words", "more words", "text", "short text"] )
          df
          ID Label Text
          0 1 a some text
          1 1 b other text
          2 1 a data
          3 2 a words
          4 2 c more words
          5 3 a text
          6 3 b short text


          Define your labels' order by doing :



          to_sort = df.Label.value_counts(ascending = True).index
          to_sort
          Index(['c', 'b', 'a'], dtype='object')


          Then make the Label column Categorical like this :



          df.Label = pd.Categorical(df.Label,categories = to_sort, ordered = True)


          Finally, sort by ID and Label :



          df.sort_values(["ID", "Label"]).reset_index(drop = True)

          ID Label Text
          0 1 b other text
          1 1 a some text
          2 1 a data
          3 2 c more words
          4 2 a words
          5 3 b short text
          6 3 a text





          share|improve this answer




















          • So why you post your answer? OP know df.Label.value_counts(ascending = True).index and .reset_index(drop = True) is only small difference with my answer :(
            – jezrael
            Nov 12 at 15:55










          • I wrote mine independently and only saw yours after posting mine. I avoid, however, to delete it because it is more detailed.
            – Neroksi
            Nov 12 at 16:40















          0














          You can achieve your goal by making the Label Column Categorical, then sort by ID and Label . Let's see it in practice.



          import pandas as pd
          df = pd.DataFrame( 'ID': [1,1,1,2,2,3,3], "Label": ["a", "b", "a", "a", "c", "a", "b"],
          'Text': ["some text", "other text","data", "words", "more words", "text", "short text"] )
          df
          ID Label Text
          0 1 a some text
          1 1 b other text
          2 1 a data
          3 2 a words
          4 2 c more words
          5 3 a text
          6 3 b short text


          Define your labels' order by doing :



          to_sort = df.Label.value_counts(ascending = True).index
          to_sort
          Index(['c', 'b', 'a'], dtype='object')


          Then make the Label column Categorical like this :



          df.Label = pd.Categorical(df.Label,categories = to_sort, ordered = True)


          Finally, sort by ID and Label :



          df.sort_values(["ID", "Label"]).reset_index(drop = True)

          ID Label Text
          0 1 b other text
          1 1 a some text
          2 1 a data
          3 2 c more words
          4 2 a words
          5 3 b short text
          6 3 a text





          share|improve this answer




















          • So why you post your answer? OP know df.Label.value_counts(ascending = True).index and .reset_index(drop = True) is only small difference with my answer :(
            – jezrael
            Nov 12 at 15:55










          • I wrote mine independently and only saw yours after posting mine. I avoid, however, to delete it because it is more detailed.
            – Neroksi
            Nov 12 at 16:40













          0












          0








          0






          You can achieve your goal by making the Label Column Categorical, then sort by ID and Label . Let's see it in practice.



          import pandas as pd
          df = pd.DataFrame( 'ID': [1,1,1,2,2,3,3], "Label": ["a", "b", "a", "a", "c", "a", "b"],
          'Text': ["some text", "other text","data", "words", "more words", "text", "short text"] )
          df
          ID Label Text
          0 1 a some text
          1 1 b other text
          2 1 a data
          3 2 a words
          4 2 c more words
          5 3 a text
          6 3 b short text


          Define your labels' order by doing :



          to_sort = df.Label.value_counts(ascending = True).index
          to_sort
          Index(['c', 'b', 'a'], dtype='object')


          Then make the Label column Categorical like this :



          df.Label = pd.Categorical(df.Label,categories = to_sort, ordered = True)


          Finally, sort by ID and Label :



          df.sort_values(["ID", "Label"]).reset_index(drop = True)

          ID Label Text
          0 1 b other text
          1 1 a some text
          2 1 a data
          3 2 c more words
          4 2 a words
          5 3 b short text
          6 3 a text





          share|improve this answer












          You can achieve your goal by making the Label Column Categorical, then sort by ID and Label . Let's see it in practice.



          import pandas as pd
          df = pd.DataFrame( 'ID': [1,1,1,2,2,3,3], "Label": ["a", "b", "a", "a", "c", "a", "b"],
          'Text': ["some text", "other text","data", "words", "more words", "text", "short text"] )
          df
          ID Label Text
          0 1 a some text
          1 1 b other text
          2 1 a data
          3 2 a words
          4 2 c more words
          5 3 a text
          6 3 b short text


          Define your labels' order by doing :



          to_sort = df.Label.value_counts(ascending = True).index
          to_sort
          Index(['c', 'b', 'a'], dtype='object')


          Then make the Label column Categorical like this :



          df.Label = pd.Categorical(df.Label,categories = to_sort, ordered = True)


          Finally, sort by ID and Label :



          df.sort_values(["ID", "Label"]).reset_index(drop = True)

          ID Label Text
          0 1 b other text
          1 1 a some text
          2 1 a data
          3 2 c more words
          4 2 a words
          5 3 b short text
          6 3 a text






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 12 at 15:53









          Neroksi

          382111




          382111











          • So why you post your answer? OP know df.Label.value_counts(ascending = True).index and .reset_index(drop = True) is only small difference with my answer :(
            – jezrael
            Nov 12 at 15:55










          • I wrote mine independently and only saw yours after posting mine. I avoid, however, to delete it because it is more detailed.
            – Neroksi
            Nov 12 at 16:40
















          • So why you post your answer? OP know df.Label.value_counts(ascending = True).index and .reset_index(drop = True) is only small difference with my answer :(
            – jezrael
            Nov 12 at 15:55










          • I wrote mine independently and only saw yours after posting mine. I avoid, however, to delete it because it is more detailed.
            – Neroksi
            Nov 12 at 16:40















          So why you post your answer? OP know df.Label.value_counts(ascending = True).index and .reset_index(drop = True) is only small difference with my answer :(
          – jezrael
          Nov 12 at 15:55




          So why you post your answer? OP know df.Label.value_counts(ascending = True).index and .reset_index(drop = True) is only small difference with my answer :(
          – jezrael
          Nov 12 at 15:55












          I wrote mine independently and only saw yours after posting mine. I avoid, however, to delete it because it is more detailed.
          – Neroksi
          Nov 12 at 16:40




          I wrote mine independently and only saw yours after posting mine. I avoid, however, to delete it because it is more detailed.
          – Neroksi
          Nov 12 at 16:40

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53265095%2fsort-column-in-pandas-dataframe-after-rarity-of-values-within-groups%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          How to read a connectionString WITH PROVIDER in .NET Core?

          Node.js Script on GitHub Pages or Amazon S3

          Museum of Modern and Contemporary Art of Trento and Rovereto