R and Simmer: Performance boost on large data frames

I've got on own dataframe on actual events/task and I use the simmer r package to simulate how many task can be done if different resources were available. My simulation runs very fast up to 120.000 rows within my dataframe.

rm(list=ls())
library(dplyr)
library(simmer)
library(simmer.plot)

load("task_df.RDATA")

working_hours <- 7.8
productivity <- 0.7
no.employees <- 292

SIM_TIME <- round((working_hours*productivity*60), 0)+1

employees <- vector("character")

for (i in 1:no.employees) 
 employees[i] <- paste("employee", i, sep="_")


taskTraj <- trajectory(name = "tasK simulation") %>%
 simmer::select(resources = employees, policy = "shortest-queue") %>%
 seize_selected(amount = 1) %>%
 timeout_from_attribute("duration") %>%
 release_selected(amount = 1)


arrivals_gen <- simmer() 

for (i in 1:no.employees) arrivals_gen %>%
 add_resource(paste("employee", i, sep="_"), capacity = 1) 
 

ptm <- proc.time()

arrivals_gen <- arrivals_gen %>%
 add_dataframe("Task_", taskTraj, task_df, mon = 2, col_time = "time", time = "absolute", col_priority="priority") %>%
 run(SIM_TIME)

proc.time() - ptm

But my dataframe tasK_df contains 350k datasets and thats the point where my simulation takes a lot of more time.

head(task_df, n = 50)

workload_shift task_id duration priority time
1 20180403 68347632 3 2.502 0
2 20180403 68151881 10 24.478 0
3 20180403 68069718 3 0.724 0
4 20180403 68345621 4 2.226 0
5 20180403 68508858 3 36.062 0
6 20180403 66148996 3 9.421 0
7 20180403 68565066 2 24.478 0
8 20180403 68005344 3 7.910 0
9 20180403 55979902 3 3.732 0
10 20180403 66452138 2 2.502 0
11 20180403 68051869 10 2.226 0
12 20180403 68561364 10 3.584 0
13 20180403 59292591 3 2.138 0
14 20180403 68415657 10 2.853 0
15 20180403 66848400 3 2.290 0
16 20180403 68454851 10 6.167 0
17 20180403 68361846 10 11.688 0
18 20180403 68572723 2 6.259 0
19 20180403 68520328 2 24.478 0
20 20180403 68500955 10 1.855 0
21 20180403 67000753 3 219.751 0
22 20180403 68487613 3 8.131 0
23 20180403 68333674 4 5.263 0
24 20180403 66423486 3 2.290 0
25 20180403 68241616 5 1.470 0
26 20180403 68415001 4 3.584 0
27 20180403 67487967 3 2.636 0
28 20180403 68494771 10 6.259 0
29 20180403 67673981 10 2.226 0
30 20180403 68355727 3 2.613 0
31 20180403 36942995 3 0.590 0
32 20180403 66633446 3 5.968 0
33 20180403 68461510 2 24.478 0
34 20180403 67126138 3 0.357 0
35 20180403 68485682 3 8.131 0
36 20180403 67852953 10 2.290 0
37 20180403 68150106 10 6.259 0
38 20180403 67833053 10 4.114 0
39 20180403 67816673 3 6.259 0
40 20180403 68041431 5 2.502 0
41 20180403 66283761 5 2.502 0
42 20180403 68543314 2 26.302 0
43 20180403 68492843 3 2.290 0
44 20180403 68556960 4 2.853 0
45 20180403 66885335 3 5.975 0
46 20180403 66249231 5 2.636 0
47 20180403 68242565 12 1.470 0
48 20180403 68530355 2 2.290 0
49 20180403 66683717 5 5.705 0
50 20180403 67802538 4 0.864 0

user system elapsed

76.745 0.039 76.717

user system elapsed
608.443 0.270 608.186

My CPU

Is there a way to boost my simulation? I use simmer 4.1.0 and Rcpp 1.0.0. Memory doesnt seems to be an issue.

edited Nov 14 '18 at 13:08

asked Nov 13 '18 at 14:18

MCR90

1

Based on your code above, I tried dataframes with 100k and 1M observations (with random data) and I see no performance issues (i.e., 1M takes x10 the time of 100k rows, as expected). Could you provide a reproducible example?

– Iñaki Úcar
Nov 14 '18 at 9:29

@IñakiÚcar Thanks in advance for your fast reply. I have updated my code snippet above to give a reproducible example.

– MCR90
Nov 14 '18 at 13:09

add a comment |

rm(list=ls())
library(dplyr)
library(simmer)
library(simmer.plot)

load("task_df.RDATA")

working_hours <- 7.8
productivity <- 0.7
no.employees <- 292

SIM_TIME <- round((working_hours*productivity*60), 0)+1

employees <- vector("character")

for (i in 1:no.employees) 
 employees[i] <- paste("employee", i, sep="_")


taskTraj <- trajectory(name = "tasK simulation") %>%
 simmer::select(resources = employees, policy = "shortest-queue") %>%
 seize_selected(amount = 1) %>%
 timeout_from_attribute("duration") %>%
 release_selected(amount = 1)


arrivals_gen <- simmer() 

for (i in 1:no.employees) arrivals_gen %>%
 add_resource(paste("employee", i, sep="_"), capacity = 1) 
 

ptm <- proc.time()

arrivals_gen <- arrivals_gen %>%
 add_dataframe("Task_", taskTraj, task_df, mon = 2, col_time = "time", time = "absolute", col_priority="priority") %>%
 run(SIM_TIME)

proc.time() - ptm

But my dataframe tasK_df contains 350k datasets and thats the point where my simulation takes a lot of more time.

head(task_df, n = 50)

workload_shift task_id duration priority time
1 20180403 68347632 3 2.502 0
2 20180403 68151881 10 24.478 0
3 20180403 68069718 3 0.724 0
4 20180403 68345621 4 2.226 0
5 20180403 68508858 3 36.062 0
6 20180403 66148996 3 9.421 0
7 20180403 68565066 2 24.478 0
8 20180403 68005344 3 7.910 0
9 20180403 55979902 3 3.732 0
10 20180403 66452138 2 2.502 0
11 20180403 68051869 10 2.226 0
12 20180403 68561364 10 3.584 0
13 20180403 59292591 3 2.138 0
14 20180403 68415657 10 2.853 0
15 20180403 66848400 3 2.290 0
16 20180403 68454851 10 6.167 0
17 20180403 68361846 10 11.688 0
18 20180403 68572723 2 6.259 0
19 20180403 68520328 2 24.478 0
20 20180403 68500955 10 1.855 0
21 20180403 67000753 3 219.751 0
22 20180403 68487613 3 8.131 0
23 20180403 68333674 4 5.263 0
24 20180403 66423486 3 2.290 0
25 20180403 68241616 5 1.470 0
26 20180403 68415001 4 3.584 0
27 20180403 67487967 3 2.636 0
28 20180403 68494771 10 6.259 0
29 20180403 67673981 10 2.226 0
30 20180403 68355727 3 2.613 0
31 20180403 36942995 3 0.590 0
32 20180403 66633446 3 5.968 0
33 20180403 68461510 2 24.478 0
34 20180403 67126138 3 0.357 0
35 20180403 68485682 3 8.131 0
36 20180403 67852953 10 2.290 0
37 20180403 68150106 10 6.259 0
38 20180403 67833053 10 4.114 0
39 20180403 67816673 3 6.259 0
40 20180403 68041431 5 2.502 0
41 20180403 66283761 5 2.502 0
42 20180403 68543314 2 26.302 0
43 20180403 68492843 3 2.290 0
44 20180403 68556960 4 2.853 0
45 20180403 66885335 3 5.975 0
46 20180403 66249231 5 2.636 0
47 20180403 68242565 12 1.470 0
48 20180403 68530355 2 2.290 0
49 20180403 66683717 5 5.705 0
50 20180403 67802538 4 0.864 0

user system elapsed

76.745 0.039 76.717

user system elapsed
608.443 0.270 608.186

My CPU

Is there a way to boost my simulation? I use simmer 4.1.0 and Rcpp 1.0.0. Memory doesnt seems to be an issue.

edited Nov 14 '18 at 13:08

asked Nov 13 '18 at 14:18

MCR90

1

Based on your code above, I tried dataframes with 100k and 1M observations (with random data) and I see no performance issues (i.e., 1M takes x10 the time of 100k rows, as expected). Could you provide a reproducible example?

– Iñaki Úcar
Nov 14 '18 at 9:29

@IñakiÚcar Thanks in advance for your fast reply. I have updated my code snippet above to give a reproducible example.

– MCR90
Nov 14 '18 at 13:09

add a comment |

rm(list=ls())
library(dplyr)
library(simmer)
library(simmer.plot)

load("task_df.RDATA")

working_hours <- 7.8
productivity <- 0.7
no.employees <- 292

SIM_TIME <- round((working_hours*productivity*60), 0)+1

employees <- vector("character")

for (i in 1:no.employees) 
 employees[i] <- paste("employee", i, sep="_")


taskTraj <- trajectory(name = "tasK simulation") %>%
 simmer::select(resources = employees, policy = "shortest-queue") %>%
 seize_selected(amount = 1) %>%
 timeout_from_attribute("duration") %>%
 release_selected(amount = 1)


arrivals_gen <- simmer() 

for (i in 1:no.employees) arrivals_gen %>%
 add_resource(paste("employee", i, sep="_"), capacity = 1) 
 

ptm <- proc.time()

arrivals_gen <- arrivals_gen %>%
 add_dataframe("Task_", taskTraj, task_df, mon = 2, col_time = "time", time = "absolute", col_priority="priority") %>%
 run(SIM_TIME)

proc.time() - ptm

But my dataframe tasK_df contains 350k datasets and thats the point where my simulation takes a lot of more time.

head(task_df, n = 50)

workload_shift task_id duration priority time
1 20180403 68347632 3 2.502 0
2 20180403 68151881 10 24.478 0
3 20180403 68069718 3 0.724 0
4 20180403 68345621 4 2.226 0
5 20180403 68508858 3 36.062 0
6 20180403 66148996 3 9.421 0
7 20180403 68565066 2 24.478 0
8 20180403 68005344 3 7.910 0
9 20180403 55979902 3 3.732 0
10 20180403 66452138 2 2.502 0
11 20180403 68051869 10 2.226 0
12 20180403 68561364 10 3.584 0
13 20180403 59292591 3 2.138 0
14 20180403 68415657 10 2.853 0
15 20180403 66848400 3 2.290 0
16 20180403 68454851 10 6.167 0
17 20180403 68361846 10 11.688 0
18 20180403 68572723 2 6.259 0
19 20180403 68520328 2 24.478 0
20 20180403 68500955 10 1.855 0
21 20180403 67000753 3 219.751 0
22 20180403 68487613 3 8.131 0
23 20180403 68333674 4 5.263 0
24 20180403 66423486 3 2.290 0
25 20180403 68241616 5 1.470 0
26 20180403 68415001 4 3.584 0
27 20180403 67487967 3 2.636 0
28 20180403 68494771 10 6.259 0
29 20180403 67673981 10 2.226 0
30 20180403 68355727 3 2.613 0
31 20180403 36942995 3 0.590 0
32 20180403 66633446 3 5.968 0
33 20180403 68461510 2 24.478 0
34 20180403 67126138 3 0.357 0
35 20180403 68485682 3 8.131 0
36 20180403 67852953 10 2.290 0
37 20180403 68150106 10 6.259 0
38 20180403 67833053 10 4.114 0
39 20180403 67816673 3 6.259 0
40 20180403 68041431 5 2.502 0
41 20180403 66283761 5 2.502 0
42 20180403 68543314 2 26.302 0
43 20180403 68492843 3 2.290 0
44 20180403 68556960 4 2.853 0
45 20180403 66885335 3 5.975 0
46 20180403 66249231 5 2.636 0
47 20180403 68242565 12 1.470 0
48 20180403 68530355 2 2.290 0
49 20180403 66683717 5 5.705 0
50 20180403 67802538 4 0.864 0

user system elapsed

76.745 0.039 76.717

user system elapsed
608.443 0.270 608.186

My CPU

Is there a way to boost my simulation? I use simmer 4.1.0 and Rcpp 1.0.0. Memory doesnt seems to be an issue.

edited Nov 14 '18 at 13:08

asked Nov 13 '18 at 14:18

MCR90

rm(list=ls())
library(dplyr)
library(simmer)
library(simmer.plot)

load("task_df.RDATA")

working_hours <- 7.8
productivity <- 0.7
no.employees <- 292

SIM_TIME <- round((working_hours*productivity*60), 0)+1

employees <- vector("character")

for (i in 1:no.employees) 
 employees[i] <- paste("employee", i, sep="_")


taskTraj <- trajectory(name = "tasK simulation") %>%
 simmer::select(resources = employees, policy = "shortest-queue") %>%
 seize_selected(amount = 1) %>%
 timeout_from_attribute("duration") %>%
 release_selected(amount = 1)


arrivals_gen <- simmer() 

for (i in 1:no.employees) arrivals_gen %>%
 add_resource(paste("employee", i, sep="_"), capacity = 1) 
 

ptm <- proc.time()

arrivals_gen <- arrivals_gen %>%
 add_dataframe("Task_", taskTraj, task_df, mon = 2, col_time = "time", time = "absolute", col_priority="priority") %>%
 run(SIM_TIME)

proc.time() - ptm

But my dataframe tasK_df contains 350k datasets and thats the point where my simulation takes a lot of more time.

head(task_df, n = 50)

workload_shift task_id duration priority time
1 20180403 68347632 3 2.502 0
2 20180403 68151881 10 24.478 0
3 20180403 68069718 3 0.724 0
4 20180403 68345621 4 2.226 0
5 20180403 68508858 3 36.062 0
6 20180403 66148996 3 9.421 0
7 20180403 68565066 2 24.478 0
8 20180403 68005344 3 7.910 0
9 20180403 55979902 3 3.732 0
10 20180403 66452138 2 2.502 0
11 20180403 68051869 10 2.226 0
12 20180403 68561364 10 3.584 0
13 20180403 59292591 3 2.138 0
14 20180403 68415657 10 2.853 0
15 20180403 66848400 3 2.290 0
16 20180403 68454851 10 6.167 0
17 20180403 68361846 10 11.688 0
18 20180403 68572723 2 6.259 0
19 20180403 68520328 2 24.478 0
20 20180403 68500955 10 1.855 0
21 20180403 67000753 3 219.751 0
22 20180403 68487613 3 8.131 0
23 20180403 68333674 4 5.263 0
24 20180403 66423486 3 2.290 0
25 20180403 68241616 5 1.470 0
26 20180403 68415001 4 3.584 0
27 20180403 67487967 3 2.636 0
28 20180403 68494771 10 6.259 0
29 20180403 67673981 10 2.226 0
30 20180403 68355727 3 2.613 0
31 20180403 36942995 3 0.590 0
32 20180403 66633446 3 5.968 0
33 20180403 68461510 2 24.478 0
34 20180403 67126138 3 0.357 0
35 20180403 68485682 3 8.131 0
36 20180403 67852953 10 2.290 0
37 20180403 68150106 10 6.259 0
38 20180403 67833053 10 4.114 0
39 20180403 67816673 3 6.259 0
40 20180403 68041431 5 2.502 0
41 20180403 66283761 5 2.502 0
42 20180403 68543314 2 26.302 0
43 20180403 68492843 3 2.290 0
44 20180403 68556960 4 2.853 0
45 20180403 66885335 3 5.975 0
46 20180403 66249231 5 2.636 0
47 20180403 68242565 12 1.470 0
48 20180403 68530355 2 2.290 0
49 20180403 66683717 5 5.705 0
50 20180403 67802538 4 0.864 0

user system elapsed

76.745 0.039 76.717

user system elapsed
608.443 0.270 608.186

My CPU

Is there a way to boost my simulation? I use simmer 4.1.0 and Rcpp 1.0.0. Memory doesnt seems to be an issue.

c++ r simulation

edited Nov 14 '18 at 13:08

asked Nov 13 '18 at 14:18

MCR90

edited Nov 14 '18 at 13:08

asked Nov 13 '18 at 14:18

MCR90

edited Nov 14 '18 at 13:08

asked Nov 13 '18 at 14:18

MCR90

asked Nov 13 '18 at 14:18

MCR90

asked Nov 13 '18 at 14:18

MCR90

1

Based on your code above, I tried dataframes with 100k and 1M observations (with random data) and I see no performance issues (i.e., 1M takes x10 the time of 100k rows, as expected). Could you provide a reproducible example?

– Iñaki Úcar
Nov 14 '18 at 9:29

@IñakiÚcar Thanks in advance for your fast reply. I have updated my code snippet above to give a reproducible example.

– MCR90
Nov 14 '18 at 13:09

add a comment |

1

Based on your code above, I tried dataframes with 100k and 1M observations (with random data) and I see no performance issues (i.e., 1M takes x10 the time of 100k rows, as expected). Could you provide a reproducible example?

– Iñaki Úcar
Nov 14 '18 at 9:29

@IñakiÚcar Thanks in advance for your fast reply. I have updated my code snippet above to give a reproducible example.

– MCR90
Nov 14 '18 at 13:09

Based on your code above, I tried dataframes with 100k and 1M observations (with random data) and I see no performance issues (i.e., 1M takes x10 the time of 100k rows, as expected). Could you provide a reproducible example?

– Iñaki Úcar
Nov 14 '18 at 9:29

@IñakiÚcar Thanks in advance for your fast reply. I have updated my code snippet above to give a reproducible example.

– MCR90
Nov 14 '18 at 13:09

add a comment |

1 Answer
1

active

oldest

votes

I took your table and simply replicated it to build 100k and 400k datasets, and I confirm the issue: the execution time is not linear.

Internally, attributes are always double, so there are lots of conversions, row by row, which apparently take most of the execution time (!). Try converting your table before feeding it into simmer. Using dplyr,

task_df <- mutate_all(task_df, as.double)

The simulation should be much faster, and the execution time for increasing number of rows should grow more or less linearly. It's evident why so many casts are degrading the performance, though I'm not sure why it makes execution time non-linear.

Anyway, in future releases, we may want to apply this automatically, so that the user doesn't have to bother about these performance issues.

answered Nov 15 '18 at 13:32

Iñaki Úcar

1969

Thank you! it worked very well and it seems to be linear!

– MCR90
Nov 15 '18 at 16:45

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283052%2fr-and-simmer-performance-boost-on-large-data-frames%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I took your table and simply replicated it to build 100k and 400k datasets, and I confirm the issue: the execution time is not linear.

task_df <- mutate_all(task_df, as.double)

Anyway, in future releases, we may want to apply this automatically, so that the user doesn't have to bother about these performance issues.

answered Nov 15 '18 at 13:32

Iñaki Úcar

1969

Thank you! it worked very well and it seems to be linear!

– MCR90
Nov 15 '18 at 16:45

add a comment |

I took your table and simply replicated it to build 100k and 400k datasets, and I confirm the issue: the execution time is not linear.

task_df <- mutate_all(task_df, as.double)

Anyway, in future releases, we may want to apply this automatically, so that the user doesn't have to bother about these performance issues.

answered Nov 15 '18 at 13:32

Iñaki Úcar

1969

Thank you! it worked very well and it seems to be linear!

– MCR90
Nov 15 '18 at 16:45

add a comment |

I took your table and simply replicated it to build 100k and 400k datasets, and I confirm the issue: the execution time is not linear.

task_df <- mutate_all(task_df, as.double)

Anyway, in future releases, we may want to apply this automatically, so that the user doesn't have to bother about these performance issues.

answered Nov 15 '18 at 13:32

Iñaki Úcar

1969

I took your table and simply replicated it to build 100k and 400k datasets, and I confirm the issue: the execution time is not linear.

task_df <- mutate_all(task_df, as.double)

Anyway, in future releases, we may want to apply this automatically, so that the user doesn't have to bother about these performance issues.

answered Nov 15 '18 at 13:32

Iñaki Úcar

1969

answered Nov 15 '18 at 13:32

Iñaki Úcar

1969

answered Nov 15 '18 at 13:32

Iñaki Úcar

1969

answered Nov 15 '18 at 13:32

Iñaki Úcar

1969

Thank you! it worked very well and it seems to be linear!

– MCR90
Nov 15 '18 at 16:45

add a comment |

Thank you! it worked very well and it seems to be linear!

– MCR90
Nov 15 '18 at 16:45

Thank you! it worked very well and it seems to be linear!

– MCR90
Nov 15 '18 at 16:45

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj