Speed up Tidyverse Calculation of Multiple Quantiles









up vote
2
down vote

favorite
2












I have this great little function summarise_posterior (given below) as part of my package driver (available here).



The function is great and super useful. The one problem is that I have been working on larger and larger data and it can be very slow. In short, my question is: Is there a tidyverse-esque way of speeding this up while still retaining the key flexibility of this function (see examples in documentation).



At least one key speed up could come from figuring out how to put the calculation of the quantiles in a single call rather than calling the quantile function over and over. The latter option which is currently implemented is probably re-sorting same vectors over and over.



#' Shortcut for summarize variable with quantiles and mean
#'
#' @param data tidy data frame
#' @param var variable name (unquoted) to be summarised
#' @param ... other expressions to pass to summarise
#'
#' @return data.frame
#' @export
#' @details Notation: codepX refers to the codeX% quantile
#' @import dplyr
#' @importFrom stats quantile
#' @importFrom rlang quos quo UQ
#' @examples
#' d <- data.frame("a"=sample(1:10, 50, TRUE),
#' "b"=rnorm(50))
#'
#' # Summarize posterior for b over grouping of a and also calcuate
#' # minmum of b (in addition to normal statistics returned)
#' d <- dplyr::group_by(d, a)
#' summarise_posterior(d, b, mean.b = mean(b), min=min(b))
summarise_posterior <- function(data, var, ...)
qvar <- enquo(var)
qs <- quos(...)


data %>%
summarise(p2.5 = quantile(!!qvar, prob=0.025),
p25 = quantile(!!qvar, prob=0.25),
p50 = quantile(!!qvar, prob=0.5),
mean = mean(!!qvar),
p75 = quantile(!!qvar, prob=0.75),
p97.5 = quantile(!!qvar, prob=0.975),
!!!qs)



Rcpp back-end options are also more than welcome.



Thanks!










share|improve this question

























    up vote
    2
    down vote

    favorite
    2












    I have this great little function summarise_posterior (given below) as part of my package driver (available here).



    The function is great and super useful. The one problem is that I have been working on larger and larger data and it can be very slow. In short, my question is: Is there a tidyverse-esque way of speeding this up while still retaining the key flexibility of this function (see examples in documentation).



    At least one key speed up could come from figuring out how to put the calculation of the quantiles in a single call rather than calling the quantile function over and over. The latter option which is currently implemented is probably re-sorting same vectors over and over.



    #' Shortcut for summarize variable with quantiles and mean
    #'
    #' @param data tidy data frame
    #' @param var variable name (unquoted) to be summarised
    #' @param ... other expressions to pass to summarise
    #'
    #' @return data.frame
    #' @export
    #' @details Notation: codepX refers to the codeX% quantile
    #' @import dplyr
    #' @importFrom stats quantile
    #' @importFrom rlang quos quo UQ
    #' @examples
    #' d <- data.frame("a"=sample(1:10, 50, TRUE),
    #' "b"=rnorm(50))
    #'
    #' # Summarize posterior for b over grouping of a and also calcuate
    #' # minmum of b (in addition to normal statistics returned)
    #' d <- dplyr::group_by(d, a)
    #' summarise_posterior(d, b, mean.b = mean(b), min=min(b))
    summarise_posterior <- function(data, var, ...)
    qvar <- enquo(var)
    qs <- quos(...)


    data %>%
    summarise(p2.5 = quantile(!!qvar, prob=0.025),
    p25 = quantile(!!qvar, prob=0.25),
    p50 = quantile(!!qvar, prob=0.5),
    mean = mean(!!qvar),
    p75 = quantile(!!qvar, prob=0.75),
    p97.5 = quantile(!!qvar, prob=0.975),
    !!!qs)



    Rcpp back-end options are also more than welcome.



    Thanks!










    share|improve this question























      up vote
      2
      down vote

      favorite
      2









      up vote
      2
      down vote

      favorite
      2






      2





      I have this great little function summarise_posterior (given below) as part of my package driver (available here).



      The function is great and super useful. The one problem is that I have been working on larger and larger data and it can be very slow. In short, my question is: Is there a tidyverse-esque way of speeding this up while still retaining the key flexibility of this function (see examples in documentation).



      At least one key speed up could come from figuring out how to put the calculation of the quantiles in a single call rather than calling the quantile function over and over. The latter option which is currently implemented is probably re-sorting same vectors over and over.



      #' Shortcut for summarize variable with quantiles and mean
      #'
      #' @param data tidy data frame
      #' @param var variable name (unquoted) to be summarised
      #' @param ... other expressions to pass to summarise
      #'
      #' @return data.frame
      #' @export
      #' @details Notation: codepX refers to the codeX% quantile
      #' @import dplyr
      #' @importFrom stats quantile
      #' @importFrom rlang quos quo UQ
      #' @examples
      #' d <- data.frame("a"=sample(1:10, 50, TRUE),
      #' "b"=rnorm(50))
      #'
      #' # Summarize posterior for b over grouping of a and also calcuate
      #' # minmum of b (in addition to normal statistics returned)
      #' d <- dplyr::group_by(d, a)
      #' summarise_posterior(d, b, mean.b = mean(b), min=min(b))
      summarise_posterior <- function(data, var, ...)
      qvar <- enquo(var)
      qs <- quos(...)


      data %>%
      summarise(p2.5 = quantile(!!qvar, prob=0.025),
      p25 = quantile(!!qvar, prob=0.25),
      p50 = quantile(!!qvar, prob=0.5),
      mean = mean(!!qvar),
      p75 = quantile(!!qvar, prob=0.75),
      p97.5 = quantile(!!qvar, prob=0.975),
      !!!qs)



      Rcpp back-end options are also more than welcome.



      Thanks!










      share|improve this question













      I have this great little function summarise_posterior (given below) as part of my package driver (available here).



      The function is great and super useful. The one problem is that I have been working on larger and larger data and it can be very slow. In short, my question is: Is there a tidyverse-esque way of speeding this up while still retaining the key flexibility of this function (see examples in documentation).



      At least one key speed up could come from figuring out how to put the calculation of the quantiles in a single call rather than calling the quantile function over and over. The latter option which is currently implemented is probably re-sorting same vectors over and over.



      #' Shortcut for summarize variable with quantiles and mean
      #'
      #' @param data tidy data frame
      #' @param var variable name (unquoted) to be summarised
      #' @param ... other expressions to pass to summarise
      #'
      #' @return data.frame
      #' @export
      #' @details Notation: codepX refers to the codeX% quantile
      #' @import dplyr
      #' @importFrom stats quantile
      #' @importFrom rlang quos quo UQ
      #' @examples
      #' d <- data.frame("a"=sample(1:10, 50, TRUE),
      #' "b"=rnorm(50))
      #'
      #' # Summarize posterior for b over grouping of a and also calcuate
      #' # minmum of b (in addition to normal statistics returned)
      #' d <- dplyr::group_by(d, a)
      #' summarise_posterior(d, b, mean.b = mean(b), min=min(b))
      summarise_posterior <- function(data, var, ...)
      qvar <- enquo(var)
      qs <- quos(...)


      data %>%
      summarise(p2.5 = quantile(!!qvar, prob=0.025),
      p25 = quantile(!!qvar, prob=0.25),
      p50 = quantile(!!qvar, prob=0.5),
      mean = mean(!!qvar),
      p75 = quantile(!!qvar, prob=0.75),
      p97.5 = quantile(!!qvar, prob=0.975),
      !!!qs)



      Rcpp back-end options are also more than welcome.



      Thanks!







      r dplyr tidyverse rlang






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 11 at 3:03









      jds

      293213




      293213






















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          Here's a solution that makes use of nesting to avoid calling quantile multiple times. Any time you need to store a vector of results inside summarize, simply wrap it inside list. Afterwards, you can unnest these results, pair them up against their names, and use spread to put them in separate columns:



          summarise_posterior2 <- function(data, var, ...)
          qvar <- ensym(var)
          vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)

          summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
          .nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
          unnest %>% spread( .nms, .qq )



          This doesn't give you nearly the same speed up as @jay.sf's solution



          d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5)) 
          microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
          f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
          # Unit: milliseconds
          # expr min lq mean median uq max neval
          # f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961 100
          # f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579 100


          but it will work correctly with group_by and inside nested functions (whereas substitute-based solutions will break when nested).



          r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
          r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
          all_equal( r1, r2 ) # TRUE


          If you profile the code, you can see where the major hang ups are



          Rprof()
          for( i in 1:100 )
          d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
          Rprof(NULL)
          summaryRprof()$by.self %>% head
          # self.time self.pct total.time total.pct
          # ".Call" 1.84 49.73 3.18 85.95
          # "sort.int" 0.94 25.41 1.12 30.27
          # "eval" 0.08 2.16 3.64 98.38
          # "tryCatch" 0.08 2.16 1.44 38.92
          # "anyNA" 0.08 2.16 0.08 2.16
          # "structure" 0.04 1.08 0.08 2.16


          The .Call corresponds mainly to the C++ backend of dplyr, while sort.int is the worker behind quantile(). @jay.sf's solution gains a major speedup by decoupling from dplyr, but it also loses the associated flexibility (e.g., integration with group_by). Ultimately, it's up to you to decide which is more important.






          share|improve this answer



























            up vote
            1
            down vote













            Why not something like this?



            summarise_posterior2 <- function(data, x, ...)
            x <- deparse(substitute(x))
            nm <- deparse(substitute(...))
            M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
            qs <- t(sapply(list(...), do.call, list(M)))
            'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
            p25 = quantile(M, prob=0.25),
            p50 = quantile(M, prob=0.5),
            mean = mean(M),
            p75 = quantile(M, prob=0.75),
            p97.5 = quantile(M, prob=0.975), qs), NULL
            )


            > summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
            p2.5 p25 p50 mean p75 p97.5 mean mean min
            [1,] 28.2 30 32 32 34 35.8 32 32 28

            # > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
            # p2.5 p25 p50 mean p75 p97.5 mean.b min
            # 1 28.2 30 32 32 34 35.8 32 28


            Runs six times faster:



            > microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
            + new.fun=summarise_posterior2(df1, X4, max=max, min=min))
            Unit: microseconds
            expr min lq mean median uq max neval
            orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316 100
            new.fun 716.071 734.694 802.9949 755.867 778.317 4759.439 100


            Data



            df1 <- data.frame(matrix(1:144, 9, 16))





            share|improve this answer




















            • Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
              – jds
              Nov 11 at 19:55










            • Also will it work with group_by?
              – jds
              Nov 11 at 19:56










            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53245493%2fspeed-up-tidyverse-calculation-of-multiple-quantiles%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote



            accepted










            Here's a solution that makes use of nesting to avoid calling quantile multiple times. Any time you need to store a vector of results inside summarize, simply wrap it inside list. Afterwards, you can unnest these results, pair them up against their names, and use spread to put them in separate columns:



            summarise_posterior2 <- function(data, var, ...)
            qvar <- ensym(var)
            vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)

            summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
            .nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
            unnest %>% spread( .nms, .qq )



            This doesn't give you nearly the same speed up as @jay.sf's solution



            d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5)) 
            microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
            f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
            # Unit: milliseconds
            # expr min lq mean median uq max neval
            # f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961 100
            # f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579 100


            but it will work correctly with group_by and inside nested functions (whereas substitute-based solutions will break when nested).



            r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
            r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
            all_equal( r1, r2 ) # TRUE


            If you profile the code, you can see where the major hang ups are



            Rprof()
            for( i in 1:100 )
            d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
            Rprof(NULL)
            summaryRprof()$by.self %>% head
            # self.time self.pct total.time total.pct
            # ".Call" 1.84 49.73 3.18 85.95
            # "sort.int" 0.94 25.41 1.12 30.27
            # "eval" 0.08 2.16 3.64 98.38
            # "tryCatch" 0.08 2.16 1.44 38.92
            # "anyNA" 0.08 2.16 0.08 2.16
            # "structure" 0.04 1.08 0.08 2.16


            The .Call corresponds mainly to the C++ backend of dplyr, while sort.int is the worker behind quantile(). @jay.sf's solution gains a major speedup by decoupling from dplyr, but it also loses the associated flexibility (e.g., integration with group_by). Ultimately, it's up to you to decide which is more important.






            share|improve this answer
























              up vote
              1
              down vote



              accepted










              Here's a solution that makes use of nesting to avoid calling quantile multiple times. Any time you need to store a vector of results inside summarize, simply wrap it inside list. Afterwards, you can unnest these results, pair them up against their names, and use spread to put them in separate columns:



              summarise_posterior2 <- function(data, var, ...)
              qvar <- ensym(var)
              vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)

              summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
              .nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
              unnest %>% spread( .nms, .qq )



              This doesn't give you nearly the same speed up as @jay.sf's solution



              d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5)) 
              microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
              f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
              # Unit: milliseconds
              # expr min lq mean median uq max neval
              # f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961 100
              # f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579 100


              but it will work correctly with group_by and inside nested functions (whereas substitute-based solutions will break when nested).



              r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
              r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
              all_equal( r1, r2 ) # TRUE


              If you profile the code, you can see where the major hang ups are



              Rprof()
              for( i in 1:100 )
              d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
              Rprof(NULL)
              summaryRprof()$by.self %>% head
              # self.time self.pct total.time total.pct
              # ".Call" 1.84 49.73 3.18 85.95
              # "sort.int" 0.94 25.41 1.12 30.27
              # "eval" 0.08 2.16 3.64 98.38
              # "tryCatch" 0.08 2.16 1.44 38.92
              # "anyNA" 0.08 2.16 0.08 2.16
              # "structure" 0.04 1.08 0.08 2.16


              The .Call corresponds mainly to the C++ backend of dplyr, while sort.int is the worker behind quantile(). @jay.sf's solution gains a major speedup by decoupling from dplyr, but it also loses the associated flexibility (e.g., integration with group_by). Ultimately, it's up to you to decide which is more important.






              share|improve this answer






















                up vote
                1
                down vote



                accepted







                up vote
                1
                down vote



                accepted






                Here's a solution that makes use of nesting to avoid calling quantile multiple times. Any time you need to store a vector of results inside summarize, simply wrap it inside list. Afterwards, you can unnest these results, pair them up against their names, and use spread to put them in separate columns:



                summarise_posterior2 <- function(data, var, ...)
                qvar <- ensym(var)
                vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)

                summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
                .nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
                unnest %>% spread( .nms, .qq )



                This doesn't give you nearly the same speed up as @jay.sf's solution



                d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5)) 
                microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
                f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
                # Unit: milliseconds
                # expr min lq mean median uq max neval
                # f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961 100
                # f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579 100


                but it will work correctly with group_by and inside nested functions (whereas substitute-based solutions will break when nested).



                r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
                r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
                all_equal( r1, r2 ) # TRUE


                If you profile the code, you can see where the major hang ups are



                Rprof()
                for( i in 1:100 )
                d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
                Rprof(NULL)
                summaryRprof()$by.self %>% head
                # self.time self.pct total.time total.pct
                # ".Call" 1.84 49.73 3.18 85.95
                # "sort.int" 0.94 25.41 1.12 30.27
                # "eval" 0.08 2.16 3.64 98.38
                # "tryCatch" 0.08 2.16 1.44 38.92
                # "anyNA" 0.08 2.16 0.08 2.16
                # "structure" 0.04 1.08 0.08 2.16


                The .Call corresponds mainly to the C++ backend of dplyr, while sort.int is the worker behind quantile(). @jay.sf's solution gains a major speedup by decoupling from dplyr, but it also loses the associated flexibility (e.g., integration with group_by). Ultimately, it's up to you to decide which is more important.






                share|improve this answer












                Here's a solution that makes use of nesting to avoid calling quantile multiple times. Any time you need to store a vector of results inside summarize, simply wrap it inside list. Afterwards, you can unnest these results, pair them up against their names, and use spread to put them in separate columns:



                summarise_posterior2 <- function(data, var, ...)
                qvar <- ensym(var)
                vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)

                summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
                .nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
                unnest %>% spread( .nms, .qq )



                This doesn't give you nearly the same speed up as @jay.sf's solution



                d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5)) 
                microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
                f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
                # Unit: milliseconds
                # expr min lq mean median uq max neval
                # f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961 100
                # f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579 100


                but it will work correctly with group_by and inside nested functions (whereas substitute-based solutions will break when nested).



                r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
                r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
                all_equal( r1, r2 ) # TRUE


                If you profile the code, you can see where the major hang ups are



                Rprof()
                for( i in 1:100 )
                d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
                Rprof(NULL)
                summaryRprof()$by.self %>% head
                # self.time self.pct total.time total.pct
                # ".Call" 1.84 49.73 3.18 85.95
                # "sort.int" 0.94 25.41 1.12 30.27
                # "eval" 0.08 2.16 3.64 98.38
                # "tryCatch" 0.08 2.16 1.44 38.92
                # "anyNA" 0.08 2.16 0.08 2.16
                # "structure" 0.04 1.08 0.08 2.16


                The .Call corresponds mainly to the C++ backend of dplyr, while sort.int is the worker behind quantile(). @jay.sf's solution gains a major speedup by decoupling from dplyr, but it also loses the associated flexibility (e.g., integration with group_by). Ultimately, it's up to you to decide which is more important.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 16 at 3:38









                Artem Sokolov

                4,65021936




                4,65021936






















                    up vote
                    1
                    down vote













                    Why not something like this?



                    summarise_posterior2 <- function(data, x, ...)
                    x <- deparse(substitute(x))
                    nm <- deparse(substitute(...))
                    M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
                    qs <- t(sapply(list(...), do.call, list(M)))
                    'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
                    p25 = quantile(M, prob=0.25),
                    p50 = quantile(M, prob=0.5),
                    mean = mean(M),
                    p75 = quantile(M, prob=0.75),
                    p97.5 = quantile(M, prob=0.975), qs), NULL
                    )


                    > summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
                    p2.5 p25 p50 mean p75 p97.5 mean mean min
                    [1,] 28.2 30 32 32 34 35.8 32 32 28

                    # > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
                    # p2.5 p25 p50 mean p75 p97.5 mean.b min
                    # 1 28.2 30 32 32 34 35.8 32 28


                    Runs six times faster:



                    > microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
                    + new.fun=summarise_posterior2(df1, X4, max=max, min=min))
                    Unit: microseconds
                    expr min lq mean median uq max neval
                    orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316 100
                    new.fun 716.071 734.694 802.9949 755.867 778.317 4759.439 100


                    Data



                    df1 <- data.frame(matrix(1:144, 9, 16))





                    share|improve this answer




















                    • Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
                      – jds
                      Nov 11 at 19:55










                    • Also will it work with group_by?
                      – jds
                      Nov 11 at 19:56














                    up vote
                    1
                    down vote













                    Why not something like this?



                    summarise_posterior2 <- function(data, x, ...)
                    x <- deparse(substitute(x))
                    nm <- deparse(substitute(...))
                    M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
                    qs <- t(sapply(list(...), do.call, list(M)))
                    'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
                    p25 = quantile(M, prob=0.25),
                    p50 = quantile(M, prob=0.5),
                    mean = mean(M),
                    p75 = quantile(M, prob=0.75),
                    p97.5 = quantile(M, prob=0.975), qs), NULL
                    )


                    > summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
                    p2.5 p25 p50 mean p75 p97.5 mean mean min
                    [1,] 28.2 30 32 32 34 35.8 32 32 28

                    # > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
                    # p2.5 p25 p50 mean p75 p97.5 mean.b min
                    # 1 28.2 30 32 32 34 35.8 32 28


                    Runs six times faster:



                    > microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
                    + new.fun=summarise_posterior2(df1, X4, max=max, min=min))
                    Unit: microseconds
                    expr min lq mean median uq max neval
                    orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316 100
                    new.fun 716.071 734.694 802.9949 755.867 778.317 4759.439 100


                    Data



                    df1 <- data.frame(matrix(1:144, 9, 16))





                    share|improve this answer




















                    • Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
                      – jds
                      Nov 11 at 19:55










                    • Also will it work with group_by?
                      – jds
                      Nov 11 at 19:56












                    up vote
                    1
                    down vote










                    up vote
                    1
                    down vote









                    Why not something like this?



                    summarise_posterior2 <- function(data, x, ...)
                    x <- deparse(substitute(x))
                    nm <- deparse(substitute(...))
                    M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
                    qs <- t(sapply(list(...), do.call, list(M)))
                    'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
                    p25 = quantile(M, prob=0.25),
                    p50 = quantile(M, prob=0.5),
                    mean = mean(M),
                    p75 = quantile(M, prob=0.75),
                    p97.5 = quantile(M, prob=0.975), qs), NULL
                    )


                    > summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
                    p2.5 p25 p50 mean p75 p97.5 mean mean min
                    [1,] 28.2 30 32 32 34 35.8 32 32 28

                    # > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
                    # p2.5 p25 p50 mean p75 p97.5 mean.b min
                    # 1 28.2 30 32 32 34 35.8 32 28


                    Runs six times faster:



                    > microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
                    + new.fun=summarise_posterior2(df1, X4, max=max, min=min))
                    Unit: microseconds
                    expr min lq mean median uq max neval
                    orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316 100
                    new.fun 716.071 734.694 802.9949 755.867 778.317 4759.439 100


                    Data



                    df1 <- data.frame(matrix(1:144, 9, 16))





                    share|improve this answer












                    Why not something like this?



                    summarise_posterior2 <- function(data, x, ...)
                    x <- deparse(substitute(x))
                    nm <- deparse(substitute(...))
                    M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
                    qs <- t(sapply(list(...), do.call, list(M)))
                    'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
                    p25 = quantile(M, prob=0.25),
                    p50 = quantile(M, prob=0.5),
                    mean = mean(M),
                    p75 = quantile(M, prob=0.75),
                    p97.5 = quantile(M, prob=0.975), qs), NULL
                    )


                    > summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
                    p2.5 p25 p50 mean p75 p97.5 mean mean min
                    [1,] 28.2 30 32 32 34 35.8 32 32 28

                    # > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
                    # p2.5 p25 p50 mean p75 p97.5 mean.b min
                    # 1 28.2 30 32 32 34 35.8 32 28


                    Runs six times faster:



                    > microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
                    + new.fun=summarise_posterior2(df1, X4, max=max, min=min))
                    Unit: microseconds
                    expr min lq mean median uq max neval
                    orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316 100
                    new.fun 716.071 734.694 802.9949 755.867 778.317 4759.439 100


                    Data



                    df1 <- data.frame(matrix(1:144, 9, 16))






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 11 at 8:27









                    jay.sf

                    4,06921435




                    4,06921435











                    • Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
                      – jds
                      Nov 11 at 19:55










                    • Also will it work with group_by?
                      – jds
                      Nov 11 at 19:56
















                    • Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
                      – jds
                      Nov 11 at 19:55










                    • Also will it work with group_by?
                      – jds
                      Nov 11 at 19:56















                    Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
                    – jds
                    Nov 11 at 19:55




                    Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right?
                    – jds
                    Nov 11 at 19:55












                    Also will it work with group_by?
                    – jds
                    Nov 11 at 19:56




                    Also will it work with group_by?
                    – jds
                    Nov 11 at 19:56

















                     

                    draft saved


                    draft discarded















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53245493%2fspeed-up-tidyverse-calculation-of-multiple-quantiles%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    How to read a connectionString WITH PROVIDER in .NET Core?

                    Guadeloupe

                    Node.js Script on GitHub Pages or Amazon S3