How to map numeric data into categories / bins in Pandas dataframe









up vote
2
down vote

favorite












I've just started coding in python, and my general coding skills are fairly rusty :( so please be a bit patient



I have a pandas dataframe:



SamplePandas



It has around 3m rows. There are 3 kinds of age_units: Y, D, W for years, Days & Weeks. Any individual over 1 year old has an age unit of Y and my first grouping I want is <2y old so all I have to test for in Age Units is Y...



I want to create a new column AgeRange and populate with the following ranges:



  • <2

  • 2 - 18

  • 18 - 35

  • 35 - 65

  • 65+

so I wrote a function



def agerange(values):
for i in values:
if complete.Age_units == 'Y':
if complete.Age > 1 AND < 18 return '2-18'
elif complete.Age > 17 AND < 35 return '18-35'
elif complete.Age > 34 AND < 65 return '35-65'
elif complete.Age > 64 return '65+'
else return '< 2'


I thought if I passed in the dataframe as a whole I would get back what I needed and then could create the column I wanted something like this:



agedetails['age_range'] = ageRange(agedetails)


BUT when I try to run the first code to create the function I get:



 File "<ipython-input-124-cf39c7ce66d9>", line 4
if complete.Age > 1 AND complete.Age < 18 return '2-18'
^
SyntaxError: invalid syntax


Clearly it is not accepting the AND - but I thought I heard in class I could use AND like this? I must be mistaken but then what would be the right way to do this?



So after getting that error, I'm not even sure the method of passing in a dataframe will throw an error either. I am guessing probably yes. In which case - how would I make that work as well?



I am looking to learn the best method, but part of the best method for me is keeping it simple even if that means doing things in a couple of steps...










share|improve this question























  • great answer below by @jpp - also regards to your invalid syntax AND should be small letters and also after if statement condition you need to use : so it should be if complete.Age > 1 and complete.Age < 18: return '2-18'
    – gyx-hh
    Mar 20 at 10:59














up vote
2
down vote

favorite












I've just started coding in python, and my general coding skills are fairly rusty :( so please be a bit patient



I have a pandas dataframe:



SamplePandas



It has around 3m rows. There are 3 kinds of age_units: Y, D, W for years, Days & Weeks. Any individual over 1 year old has an age unit of Y and my first grouping I want is <2y old so all I have to test for in Age Units is Y...



I want to create a new column AgeRange and populate with the following ranges:



  • <2

  • 2 - 18

  • 18 - 35

  • 35 - 65

  • 65+

so I wrote a function



def agerange(values):
for i in values:
if complete.Age_units == 'Y':
if complete.Age > 1 AND < 18 return '2-18'
elif complete.Age > 17 AND < 35 return '18-35'
elif complete.Age > 34 AND < 65 return '35-65'
elif complete.Age > 64 return '65+'
else return '< 2'


I thought if I passed in the dataframe as a whole I would get back what I needed and then could create the column I wanted something like this:



agedetails['age_range'] = ageRange(agedetails)


BUT when I try to run the first code to create the function I get:



 File "<ipython-input-124-cf39c7ce66d9>", line 4
if complete.Age > 1 AND complete.Age < 18 return '2-18'
^
SyntaxError: invalid syntax


Clearly it is not accepting the AND - but I thought I heard in class I could use AND like this? I must be mistaken but then what would be the right way to do this?



So after getting that error, I'm not even sure the method of passing in a dataframe will throw an error either. I am guessing probably yes. In which case - how would I make that work as well?



I am looking to learn the best method, but part of the best method for me is keeping it simple even if that means doing things in a couple of steps...










share|improve this question























  • great answer below by @jpp - also regards to your invalid syntax AND should be small letters and also after if statement condition you need to use : so it should be if complete.Age > 1 and complete.Age < 18: return '2-18'
    – gyx-hh
    Mar 20 at 10:59












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I've just started coding in python, and my general coding skills are fairly rusty :( so please be a bit patient



I have a pandas dataframe:



SamplePandas



It has around 3m rows. There are 3 kinds of age_units: Y, D, W for years, Days & Weeks. Any individual over 1 year old has an age unit of Y and my first grouping I want is <2y old so all I have to test for in Age Units is Y...



I want to create a new column AgeRange and populate with the following ranges:



  • <2

  • 2 - 18

  • 18 - 35

  • 35 - 65

  • 65+

so I wrote a function



def agerange(values):
for i in values:
if complete.Age_units == 'Y':
if complete.Age > 1 AND < 18 return '2-18'
elif complete.Age > 17 AND < 35 return '18-35'
elif complete.Age > 34 AND < 65 return '35-65'
elif complete.Age > 64 return '65+'
else return '< 2'


I thought if I passed in the dataframe as a whole I would get back what I needed and then could create the column I wanted something like this:



agedetails['age_range'] = ageRange(agedetails)


BUT when I try to run the first code to create the function I get:



 File "<ipython-input-124-cf39c7ce66d9>", line 4
if complete.Age > 1 AND complete.Age < 18 return '2-18'
^
SyntaxError: invalid syntax


Clearly it is not accepting the AND - but I thought I heard in class I could use AND like this? I must be mistaken but then what would be the right way to do this?



So after getting that error, I'm not even sure the method of passing in a dataframe will throw an error either. I am guessing probably yes. In which case - how would I make that work as well?



I am looking to learn the best method, but part of the best method for me is keeping it simple even if that means doing things in a couple of steps...










share|improve this question















I've just started coding in python, and my general coding skills are fairly rusty :( so please be a bit patient



I have a pandas dataframe:



SamplePandas



It has around 3m rows. There are 3 kinds of age_units: Y, D, W for years, Days & Weeks. Any individual over 1 year old has an age unit of Y and my first grouping I want is <2y old so all I have to test for in Age Units is Y...



I want to create a new column AgeRange and populate with the following ranges:



  • <2

  • 2 - 18

  • 18 - 35

  • 35 - 65

  • 65+

so I wrote a function



def agerange(values):
for i in values:
if complete.Age_units == 'Y':
if complete.Age > 1 AND < 18 return '2-18'
elif complete.Age > 17 AND < 35 return '18-35'
elif complete.Age > 34 AND < 65 return '35-65'
elif complete.Age > 64 return '65+'
else return '< 2'


I thought if I passed in the dataframe as a whole I would get back what I needed and then could create the column I wanted something like this:



agedetails['age_range'] = ageRange(agedetails)


BUT when I try to run the first code to create the function I get:



 File "<ipython-input-124-cf39c7ce66d9>", line 4
if complete.Age > 1 AND complete.Age < 18 return '2-18'
^
SyntaxError: invalid syntax


Clearly it is not accepting the AND - but I thought I heard in class I could use AND like this? I must be mistaken but then what would be the right way to do this?



So after getting that error, I'm not even sure the method of passing in a dataframe will throw an error either. I am guessing probably yes. In which case - how would I make that work as well?



I am looking to learn the best method, but part of the best method for me is keeping it simple even if that means doing things in a couple of steps...







python python-2.7 pandas numpy dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 19 at 0:14









jpp

87.1k194999




87.1k194999










asked Mar 20 at 10:48









kiltannen

241213




241213











  • great answer below by @jpp - also regards to your invalid syntax AND should be small letters and also after if statement condition you need to use : so it should be if complete.Age > 1 and complete.Age < 18: return '2-18'
    – gyx-hh
    Mar 20 at 10:59
















  • great answer below by @jpp - also regards to your invalid syntax AND should be small letters and also after if statement condition you need to use : so it should be if complete.Age > 1 and complete.Age < 18: return '2-18'
    – gyx-hh
    Mar 20 at 10:59















great answer below by @jpp - also regards to your invalid syntax AND should be small letters and also after if statement condition you need to use : so it should be if complete.Age > 1 and complete.Age < 18: return '2-18'
– gyx-hh
Mar 20 at 10:59




great answer below by @jpp - also regards to your invalid syntax AND should be small letters and also after if statement condition you need to use : so it should be if complete.Age > 1 and complete.Age < 18: return '2-18'
– gyx-hh
Mar 20 at 10:59












1 Answer
1






active

oldest

votes

















up vote
8
down vote



accepted










With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.



Pandas: pd.cut



As @JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.



You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.



bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']

df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)

print(df.dtypes)

# Age int64
# Age_units object
# AgeRange category
# dtype: object


NumPy: np.digitize



np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.



Note that for boundary cases the lower bound is used for mapping to a bin.



import pandas as pd, numpy as np

df = pd.DataFrame('Age': [99, 53, 71, 84, 84],
'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y'])

bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']

d = dict(enumerate(names, 1))

df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))


Result



 Age Age_units AgeRange
0 99 Y 65+
1 53 Y 35-65
2 71 Y 65+
3 84 Y 65+
4 84 Y 65+





share|improve this answer


















  • 2




    Or... add float('inf') (or np.inf) to the end of bins, and then use: pd.cut(df.Age, bins, labels=names)... That way you'll get a categorical series instead of a string...
    – Jon Clements
    Mar 20 at 11:02







  • 1




    @jpp This is BRILLIANT! Thank you for taking the trouble to provide such a clear and well thought through response, and adding in the bins/ pandas cut method with detail is the perfect icing on the cake. This is the simplest most elegant approach, and I am certainly using it thank you. I had seen somewhere in all the looking I was doing something about Bins - but hadn't figured out how to apply it, and certainly not how easy it would be! Thanks again!
    – kiltannen
    Mar 20 at 20:37










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f49382207%2fhow-to-map-numeric-data-into-categories-bins-in-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
8
down vote



accepted










With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.



Pandas: pd.cut



As @JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.



You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.



bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']

df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)

print(df.dtypes)

# Age int64
# Age_units object
# AgeRange category
# dtype: object


NumPy: np.digitize



np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.



Note that for boundary cases the lower bound is used for mapping to a bin.



import pandas as pd, numpy as np

df = pd.DataFrame('Age': [99, 53, 71, 84, 84],
'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y'])

bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']

d = dict(enumerate(names, 1))

df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))


Result



 Age Age_units AgeRange
0 99 Y 65+
1 53 Y 35-65
2 71 Y 65+
3 84 Y 65+
4 84 Y 65+





share|improve this answer


















  • 2




    Or... add float('inf') (or np.inf) to the end of bins, and then use: pd.cut(df.Age, bins, labels=names)... That way you'll get a categorical series instead of a string...
    – Jon Clements
    Mar 20 at 11:02







  • 1




    @jpp This is BRILLIANT! Thank you for taking the trouble to provide such a clear and well thought through response, and adding in the bins/ pandas cut method with detail is the perfect icing on the cake. This is the simplest most elegant approach, and I am certainly using it thank you. I had seen somewhere in all the looking I was doing something about Bins - but hadn't figured out how to apply it, and certainly not how easy it would be! Thanks again!
    – kiltannen
    Mar 20 at 20:37














up vote
8
down vote



accepted










With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.



Pandas: pd.cut



As @JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.



You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.



bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']

df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)

print(df.dtypes)

# Age int64
# Age_units object
# AgeRange category
# dtype: object


NumPy: np.digitize



np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.



Note that for boundary cases the lower bound is used for mapping to a bin.



import pandas as pd, numpy as np

df = pd.DataFrame('Age': [99, 53, 71, 84, 84],
'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y'])

bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']

d = dict(enumerate(names, 1))

df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))


Result



 Age Age_units AgeRange
0 99 Y 65+
1 53 Y 35-65
2 71 Y 65+
3 84 Y 65+
4 84 Y 65+





share|improve this answer


















  • 2




    Or... add float('inf') (or np.inf) to the end of bins, and then use: pd.cut(df.Age, bins, labels=names)... That way you'll get a categorical series instead of a string...
    – Jon Clements
    Mar 20 at 11:02







  • 1




    @jpp This is BRILLIANT! Thank you for taking the trouble to provide such a clear and well thought through response, and adding in the bins/ pandas cut method with detail is the perfect icing on the cake. This is the simplest most elegant approach, and I am certainly using it thank you. I had seen somewhere in all the looking I was doing something about Bins - but hadn't figured out how to apply it, and certainly not how easy it would be! Thanks again!
    – kiltannen
    Mar 20 at 20:37












up vote
8
down vote



accepted







up vote
8
down vote



accepted






With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.



Pandas: pd.cut



As @JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.



You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.



bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']

df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)

print(df.dtypes)

# Age int64
# Age_units object
# AgeRange category
# dtype: object


NumPy: np.digitize



np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.



Note that for boundary cases the lower bound is used for mapping to a bin.



import pandas as pd, numpy as np

df = pd.DataFrame('Age': [99, 53, 71, 84, 84],
'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y'])

bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']

d = dict(enumerate(names, 1))

df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))


Result



 Age Age_units AgeRange
0 99 Y 65+
1 53 Y 35-65
2 71 Y 65+
3 84 Y 65+
4 84 Y 65+





share|improve this answer














With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.



Pandas: pd.cut



As @JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.



You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.



bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']

df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)

print(df.dtypes)

# Age int64
# Age_units object
# AgeRange category
# dtype: object


NumPy: np.digitize



np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.



Note that for boundary cases the lower bound is used for mapping to a bin.



import pandas as pd, numpy as np

df = pd.DataFrame('Age': [99, 53, 71, 84, 84],
'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y'])

bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']

d = dict(enumerate(names, 1))

df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))


Result



 Age Age_units AgeRange
0 99 Y 65+
1 53 Y 35-65
2 71 Y 65+
3 84 Y 65+
4 84 Y 65+






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 27 at 19:48









user3483203

29.7k72353




29.7k72353










answered Mar 20 at 10:55









jpp

87.1k194999




87.1k194999







  • 2




    Or... add float('inf') (or np.inf) to the end of bins, and then use: pd.cut(df.Age, bins, labels=names)... That way you'll get a categorical series instead of a string...
    – Jon Clements
    Mar 20 at 11:02







  • 1




    @jpp This is BRILLIANT! Thank you for taking the trouble to provide such a clear and well thought through response, and adding in the bins/ pandas cut method with detail is the perfect icing on the cake. This is the simplest most elegant approach, and I am certainly using it thank you. I had seen somewhere in all the looking I was doing something about Bins - but hadn't figured out how to apply it, and certainly not how easy it would be! Thanks again!
    – kiltannen
    Mar 20 at 20:37












  • 2




    Or... add float('inf') (or np.inf) to the end of bins, and then use: pd.cut(df.Age, bins, labels=names)... That way you'll get a categorical series instead of a string...
    – Jon Clements
    Mar 20 at 11:02







  • 1




    @jpp This is BRILLIANT! Thank you for taking the trouble to provide such a clear and well thought through response, and adding in the bins/ pandas cut method with detail is the perfect icing on the cake. This is the simplest most elegant approach, and I am certainly using it thank you. I had seen somewhere in all the looking I was doing something about Bins - but hadn't figured out how to apply it, and certainly not how easy it would be! Thanks again!
    – kiltannen
    Mar 20 at 20:37







2




2




Or... add float('inf') (or np.inf) to the end of bins, and then use: pd.cut(df.Age, bins, labels=names)... That way you'll get a categorical series instead of a string...
– Jon Clements
Mar 20 at 11:02





Or... add float('inf') (or np.inf) to the end of bins, and then use: pd.cut(df.Age, bins, labels=names)... That way you'll get a categorical series instead of a string...
– Jon Clements
Mar 20 at 11:02





1




1




@jpp This is BRILLIANT! Thank you for taking the trouble to provide such a clear and well thought through response, and adding in the bins/ pandas cut method with detail is the perfect icing on the cake. This is the simplest most elegant approach, and I am certainly using it thank you. I had seen somewhere in all the looking I was doing something about Bins - but hadn't figured out how to apply it, and certainly not how easy it would be! Thanks again!
– kiltannen
Mar 20 at 20:37




@jpp This is BRILLIANT! Thank you for taking the trouble to provide such a clear and well thought through response, and adding in the bins/ pandas cut method with detail is the perfect icing on the cake. This is the simplest most elegant approach, and I am certainly using it thank you. I had seen somewhere in all the looking I was doing something about Bins - but hadn't figured out how to apply it, and certainly not how easy it would be! Thanks again!
– kiltannen
Mar 20 at 20:37

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f49382207%2fhow-to-map-numeric-data-into-categories-bins-in-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Barbados

How to read a connectionString WITH PROVIDER in .NET Core?

Node.js Script on GitHub Pages or Amazon S3