Conditionally Select Rows within a Group with Data.Table
I am looking for solutions using data.table ― I have a data.table with the following columns:
data <- data.frame(GROUP=c(3,3,4,4,5,6),
YEAR=c(1979,1985,1999,2011,2012,1994),
NAME=c("S","A","J","L","G","A"))
data <- as.data.table(data)
Data.table:
GROUP YEAR NAME
3 1979 Smith
3 1985 Anderson
4 1999 James
4 2011 Liam
5 2012 George
6 1994 Adams
For each group we want to select one row using the following rule:
- If there is a year > 2000, select the row with minimum year above 2000.
- If there not a year > 2000, select the row with the maximum year.
Desired output:
GROUP YEAR NAME
3 1985 Anderson
4 2011 Liam
5 2012 George
6 1994 Adams
Thanks! I have been struggling with this for a while.
r data.table
add a comment |
I am looking for solutions using data.table ― I have a data.table with the following columns:
data <- data.frame(GROUP=c(3,3,4,4,5,6),
YEAR=c(1979,1985,1999,2011,2012,1994),
NAME=c("S","A","J","L","G","A"))
data <- as.data.table(data)
Data.table:
GROUP YEAR NAME
3 1979 Smith
3 1985 Anderson
4 1999 James
4 2011 Liam
5 2012 George
6 1994 Adams
For each group we want to select one row using the following rule:
- If there is a year > 2000, select the row with minimum year above 2000.
- If there not a year > 2000, select the row with the maximum year.
Desired output:
GROUP YEAR NAME
3 1985 Anderson
4 2011 Liam
5 2012 George
6 1994 Adams
Thanks! I have been struggling with this for a while.
r data.table
add a comment |
I am looking for solutions using data.table ― I have a data.table with the following columns:
data <- data.frame(GROUP=c(3,3,4,4,5,6),
YEAR=c(1979,1985,1999,2011,2012,1994),
NAME=c("S","A","J","L","G","A"))
data <- as.data.table(data)
Data.table:
GROUP YEAR NAME
3 1979 Smith
3 1985 Anderson
4 1999 James
4 2011 Liam
5 2012 George
6 1994 Adams
For each group we want to select one row using the following rule:
- If there is a year > 2000, select the row with minimum year above 2000.
- If there not a year > 2000, select the row with the maximum year.
Desired output:
GROUP YEAR NAME
3 1985 Anderson
4 2011 Liam
5 2012 George
6 1994 Adams
Thanks! I have been struggling with this for a while.
r data.table
I am looking for solutions using data.table ― I have a data.table with the following columns:
data <- data.frame(GROUP=c(3,3,4,4,5,6),
YEAR=c(1979,1985,1999,2011,2012,1994),
NAME=c("S","A","J","L","G","A"))
data <- as.data.table(data)
Data.table:
GROUP YEAR NAME
3 1979 Smith
3 1985 Anderson
4 1999 James
4 2011 Liam
5 2012 George
6 1994 Adams
For each group we want to select one row using the following rule:
- If there is a year > 2000, select the row with minimum year above 2000.
- If there not a year > 2000, select the row with the maximum year.
Desired output:
GROUP YEAR NAME
3 1985 Anderson
4 2011 Liam
5 2012 George
6 1994 Adams
Thanks! I have been struggling with this for a while.
r data.table
r data.table
edited Nov 12 at 6:01
asked Nov 12 at 4:16
CFB
544
544
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
data.table
should be a lot simpler if you subset the special .I
row counter:
library(data.table)
setDT(data)
data[
data[
,
if(any(YEAR > 2000))
.I[which.min(2000 - YEAR)] else
.I[which.max(YEAR)],
by=GROUP
]$V1
]
# GROUP YEAR NAME
#1: 3 1985 A
#2: 4 2011 L
#3: 5 2012 G
#4: 6 1994 A
Thanks to @r2evans for the background info -
.I
is an integer vector equivalent toseq_len(nrow(x))
.
Ref:
http://rdrr.io/cran/data.table/man/special-symbols.html
So, all I'm doing here is getting the matching row index for the whole of data
for each of the calculations at each by=
level. Then using these row indexes to subset data
again.
1
I get an errorError in [.data.frame(data, , if (any(YEAR > 2000)) .I[which.min(2000 - : unused argument (by = GROUP)
Does it work exactly as is for you?
– RAB
Nov 12 at 5:22
2
@user10626943 - the post is taggeddata.table
so I assumed OP was already working with adata.table
- if not, you need to convert first. Have edited.
– thelatemail
Nov 12 at 5:24
2
For late-comers,.I
is an integer vector equivalent toseq_len(nrow(x))
. Ref: rdrr.io/cran/data.table/man/special-symbols.html (I had to look it up :-)
– r2evans
Nov 12 at 5:53
add a comment |
You could also do a couple rolling joins:
res = unique(data[, .(GROUP)])
# get row with YEAR above 2000
res[, w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=-Inf, which=TRUE]]
# if none found, get row with nearest YEAR below
res[is.na(w), w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=Inf, which=TRUE]]
# subset by row numbers
data[res$w]
GROUP YEAR NAME
1: 3 1985 A
2: 4 2011 L
3: 5 2012 G
4: 6 1994 A
add a comment |
Using the dplyr
package I got your output like this (though it may not be the simplest answer):
library(dplyr)
library(magrittr)
data <- data.frame(GROUP=c(3,3,4,4,5,6),
YEAR=c(1979,1985,1999,2011,2012,1994),
NAME=c("S","A","J","L","G","A"))
data %>%
subset(YEAR < 2000) %>%
group_by(GROUP) %>%
summarise(MAX=max(YEAR)) %>%
join(data %>%
subset(YEAR > 2000) %>%
group_by(GROUP) %>%
summarise(MIN=min(YEAR)), type="full") %>%
mutate(YEAR=ifelse(is.na(MIN), MAX, MIN)) %>%
select(c(GROUP, YEAR)) %>%
join(data)
Results:
GROUP YEAR NAME
3 1985 A
4 2011 L
5 2012 G
6 1994 A
EDIT: Sorry, my first answer didn't take into account the min/max conditions. Hope this helps
1
Thanks for the tidyverse solution! and for the formatting pointers.
– CFB
Nov 12 at 5:47
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53255897%2fconditionally-select-rows-within-a-group-with-data-table%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
data.table
should be a lot simpler if you subset the special .I
row counter:
library(data.table)
setDT(data)
data[
data[
,
if(any(YEAR > 2000))
.I[which.min(2000 - YEAR)] else
.I[which.max(YEAR)],
by=GROUP
]$V1
]
# GROUP YEAR NAME
#1: 3 1985 A
#2: 4 2011 L
#3: 5 2012 G
#4: 6 1994 A
Thanks to @r2evans for the background info -
.I
is an integer vector equivalent toseq_len(nrow(x))
.
Ref:
http://rdrr.io/cran/data.table/man/special-symbols.html
So, all I'm doing here is getting the matching row index for the whole of data
for each of the calculations at each by=
level. Then using these row indexes to subset data
again.
1
I get an errorError in [.data.frame(data, , if (any(YEAR > 2000)) .I[which.min(2000 - : unused argument (by = GROUP)
Does it work exactly as is for you?
– RAB
Nov 12 at 5:22
2
@user10626943 - the post is taggeddata.table
so I assumed OP was already working with adata.table
- if not, you need to convert first. Have edited.
– thelatemail
Nov 12 at 5:24
2
For late-comers,.I
is an integer vector equivalent toseq_len(nrow(x))
. Ref: rdrr.io/cran/data.table/man/special-symbols.html (I had to look it up :-)
– r2evans
Nov 12 at 5:53
add a comment |
data.table
should be a lot simpler if you subset the special .I
row counter:
library(data.table)
setDT(data)
data[
data[
,
if(any(YEAR > 2000))
.I[which.min(2000 - YEAR)] else
.I[which.max(YEAR)],
by=GROUP
]$V1
]
# GROUP YEAR NAME
#1: 3 1985 A
#2: 4 2011 L
#3: 5 2012 G
#4: 6 1994 A
Thanks to @r2evans for the background info -
.I
is an integer vector equivalent toseq_len(nrow(x))
.
Ref:
http://rdrr.io/cran/data.table/man/special-symbols.html
So, all I'm doing here is getting the matching row index for the whole of data
for each of the calculations at each by=
level. Then using these row indexes to subset data
again.
1
I get an errorError in [.data.frame(data, , if (any(YEAR > 2000)) .I[which.min(2000 - : unused argument (by = GROUP)
Does it work exactly as is for you?
– RAB
Nov 12 at 5:22
2
@user10626943 - the post is taggeddata.table
so I assumed OP was already working with adata.table
- if not, you need to convert first. Have edited.
– thelatemail
Nov 12 at 5:24
2
For late-comers,.I
is an integer vector equivalent toseq_len(nrow(x))
. Ref: rdrr.io/cran/data.table/man/special-symbols.html (I had to look it up :-)
– r2evans
Nov 12 at 5:53
add a comment |
data.table
should be a lot simpler if you subset the special .I
row counter:
library(data.table)
setDT(data)
data[
data[
,
if(any(YEAR > 2000))
.I[which.min(2000 - YEAR)] else
.I[which.max(YEAR)],
by=GROUP
]$V1
]
# GROUP YEAR NAME
#1: 3 1985 A
#2: 4 2011 L
#3: 5 2012 G
#4: 6 1994 A
Thanks to @r2evans for the background info -
.I
is an integer vector equivalent toseq_len(nrow(x))
.
Ref:
http://rdrr.io/cran/data.table/man/special-symbols.html
So, all I'm doing here is getting the matching row index for the whole of data
for each of the calculations at each by=
level. Then using these row indexes to subset data
again.
data.table
should be a lot simpler if you subset the special .I
row counter:
library(data.table)
setDT(data)
data[
data[
,
if(any(YEAR > 2000))
.I[which.min(2000 - YEAR)] else
.I[which.max(YEAR)],
by=GROUP
]$V1
]
# GROUP YEAR NAME
#1: 3 1985 A
#2: 4 2011 L
#3: 5 2012 G
#4: 6 1994 A
Thanks to @r2evans for the background info -
.I
is an integer vector equivalent toseq_len(nrow(x))
.
Ref:
http://rdrr.io/cran/data.table/man/special-symbols.html
So, all I'm doing here is getting the matching row index for the whole of data
for each of the calculations at each by=
level. Then using these row indexes to subset data
again.
edited Nov 12 at 6:05
answered Nov 12 at 5:09
thelatemail
66.8k881149
66.8k881149
1
I get an errorError in [.data.frame(data, , if (any(YEAR > 2000)) .I[which.min(2000 - : unused argument (by = GROUP)
Does it work exactly as is for you?
– RAB
Nov 12 at 5:22
2
@user10626943 - the post is taggeddata.table
so I assumed OP was already working with adata.table
- if not, you need to convert first. Have edited.
– thelatemail
Nov 12 at 5:24
2
For late-comers,.I
is an integer vector equivalent toseq_len(nrow(x))
. Ref: rdrr.io/cran/data.table/man/special-symbols.html (I had to look it up :-)
– r2evans
Nov 12 at 5:53
add a comment |
1
I get an errorError in [.data.frame(data, , if (any(YEAR > 2000)) .I[which.min(2000 - : unused argument (by = GROUP)
Does it work exactly as is for you?
– RAB
Nov 12 at 5:22
2
@user10626943 - the post is taggeddata.table
so I assumed OP was already working with adata.table
- if not, you need to convert first. Have edited.
– thelatemail
Nov 12 at 5:24
2
For late-comers,.I
is an integer vector equivalent toseq_len(nrow(x))
. Ref: rdrr.io/cran/data.table/man/special-symbols.html (I had to look it up :-)
– r2evans
Nov 12 at 5:53
1
1
I get an error
Error in [.data.frame(data, , if (any(YEAR > 2000)) .I[which.min(2000 - : unused argument (by = GROUP)
Does it work exactly as is for you?– RAB
Nov 12 at 5:22
I get an error
Error in [.data.frame(data, , if (any(YEAR > 2000)) .I[which.min(2000 - : unused argument (by = GROUP)
Does it work exactly as is for you?– RAB
Nov 12 at 5:22
2
2
@user10626943 - the post is tagged
data.table
so I assumed OP was already working with a data.table
- if not, you need to convert first. Have edited.– thelatemail
Nov 12 at 5:24
@user10626943 - the post is tagged
data.table
so I assumed OP was already working with a data.table
- if not, you need to convert first. Have edited.– thelatemail
Nov 12 at 5:24
2
2
For late-comers,
.I
is an integer vector equivalent to seq_len(nrow(x))
. Ref: rdrr.io/cran/data.table/man/special-symbols.html (I had to look it up :-)– r2evans
Nov 12 at 5:53
For late-comers,
.I
is an integer vector equivalent to seq_len(nrow(x))
. Ref: rdrr.io/cran/data.table/man/special-symbols.html (I had to look it up :-)– r2evans
Nov 12 at 5:53
add a comment |
You could also do a couple rolling joins:
res = unique(data[, .(GROUP)])
# get row with YEAR above 2000
res[, w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=-Inf, which=TRUE]]
# if none found, get row with nearest YEAR below
res[is.na(w), w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=Inf, which=TRUE]]
# subset by row numbers
data[res$w]
GROUP YEAR NAME
1: 3 1985 A
2: 4 2011 L
3: 5 2012 G
4: 6 1994 A
add a comment |
You could also do a couple rolling joins:
res = unique(data[, .(GROUP)])
# get row with YEAR above 2000
res[, w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=-Inf, which=TRUE]]
# if none found, get row with nearest YEAR below
res[is.na(w), w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=Inf, which=TRUE]]
# subset by row numbers
data[res$w]
GROUP YEAR NAME
1: 3 1985 A
2: 4 2011 L
3: 5 2012 G
4: 6 1994 A
add a comment |
You could also do a couple rolling joins:
res = unique(data[, .(GROUP)])
# get row with YEAR above 2000
res[, w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=-Inf, which=TRUE]]
# if none found, get row with nearest YEAR below
res[is.na(w), w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=Inf, which=TRUE]]
# subset by row numbers
data[res$w]
GROUP YEAR NAME
1: 3 1985 A
2: 4 2011 L
3: 5 2012 G
4: 6 1994 A
You could also do a couple rolling joins:
res = unique(data[, .(GROUP)])
# get row with YEAR above 2000
res[, w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=-Inf, which=TRUE]]
# if none found, get row with nearest YEAR below
res[is.na(w), w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=Inf, which=TRUE]]
# subset by row numbers
data[res$w]
GROUP YEAR NAME
1: 3 1985 A
2: 4 2011 L
3: 5 2012 G
4: 6 1994 A
answered Nov 12 at 14:22
Frank
53.6k653127
53.6k653127
add a comment |
add a comment |
Using the dplyr
package I got your output like this (though it may not be the simplest answer):
library(dplyr)
library(magrittr)
data <- data.frame(GROUP=c(3,3,4,4,5,6),
YEAR=c(1979,1985,1999,2011,2012,1994),
NAME=c("S","A","J","L","G","A"))
data %>%
subset(YEAR < 2000) %>%
group_by(GROUP) %>%
summarise(MAX=max(YEAR)) %>%
join(data %>%
subset(YEAR > 2000) %>%
group_by(GROUP) %>%
summarise(MIN=min(YEAR)), type="full") %>%
mutate(YEAR=ifelse(is.na(MIN), MAX, MIN)) %>%
select(c(GROUP, YEAR)) %>%
join(data)
Results:
GROUP YEAR NAME
3 1985 A
4 2011 L
5 2012 G
6 1994 A
EDIT: Sorry, my first answer didn't take into account the min/max conditions. Hope this helps
1
Thanks for the tidyverse solution! and for the formatting pointers.
– CFB
Nov 12 at 5:47
add a comment |
Using the dplyr
package I got your output like this (though it may not be the simplest answer):
library(dplyr)
library(magrittr)
data <- data.frame(GROUP=c(3,3,4,4,5,6),
YEAR=c(1979,1985,1999,2011,2012,1994),
NAME=c("S","A","J","L","G","A"))
data %>%
subset(YEAR < 2000) %>%
group_by(GROUP) %>%
summarise(MAX=max(YEAR)) %>%
join(data %>%
subset(YEAR > 2000) %>%
group_by(GROUP) %>%
summarise(MIN=min(YEAR)), type="full") %>%
mutate(YEAR=ifelse(is.na(MIN), MAX, MIN)) %>%
select(c(GROUP, YEAR)) %>%
join(data)
Results:
GROUP YEAR NAME
3 1985 A
4 2011 L
5 2012 G
6 1994 A
EDIT: Sorry, my first answer didn't take into account the min/max conditions. Hope this helps
1
Thanks for the tidyverse solution! and for the formatting pointers.
– CFB
Nov 12 at 5:47
add a comment |
Using the dplyr
package I got your output like this (though it may not be the simplest answer):
library(dplyr)
library(magrittr)
data <- data.frame(GROUP=c(3,3,4,4,5,6),
YEAR=c(1979,1985,1999,2011,2012,1994),
NAME=c("S","A","J","L","G","A"))
data %>%
subset(YEAR < 2000) %>%
group_by(GROUP) %>%
summarise(MAX=max(YEAR)) %>%
join(data %>%
subset(YEAR > 2000) %>%
group_by(GROUP) %>%
summarise(MIN=min(YEAR)), type="full") %>%
mutate(YEAR=ifelse(is.na(MIN), MAX, MIN)) %>%
select(c(GROUP, YEAR)) %>%
join(data)
Results:
GROUP YEAR NAME
3 1985 A
4 2011 L
5 2012 G
6 1994 A
EDIT: Sorry, my first answer didn't take into account the min/max conditions. Hope this helps
Using the dplyr
package I got your output like this (though it may not be the simplest answer):
library(dplyr)
library(magrittr)
data <- data.frame(GROUP=c(3,3,4,4,5,6),
YEAR=c(1979,1985,1999,2011,2012,1994),
NAME=c("S","A","J","L","G","A"))
data %>%
subset(YEAR < 2000) %>%
group_by(GROUP) %>%
summarise(MAX=max(YEAR)) %>%
join(data %>%
subset(YEAR > 2000) %>%
group_by(GROUP) %>%
summarise(MIN=min(YEAR)), type="full") %>%
mutate(YEAR=ifelse(is.na(MIN), MAX, MIN)) %>%
select(c(GROUP, YEAR)) %>%
join(data)
Results:
GROUP YEAR NAME
3 1985 A
4 2011 L
5 2012 G
6 1994 A
EDIT: Sorry, my first answer didn't take into account the min/max conditions. Hope this helps
edited Nov 12 at 4:48
answered Nov 12 at 4:33
RAB
50715
50715
1
Thanks for the tidyverse solution! and for the formatting pointers.
– CFB
Nov 12 at 5:47
add a comment |
1
Thanks for the tidyverse solution! and for the formatting pointers.
– CFB
Nov 12 at 5:47
1
1
Thanks for the tidyverse solution! and for the formatting pointers.
– CFB
Nov 12 at 5:47
Thanks for the tidyverse solution! and for the formatting pointers.
– CFB
Nov 12 at 5:47
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53255897%2fconditionally-select-rows-within-a-group-with-data-table%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown