Ask.Cyberinfrastructure

How do I use DMTCP to create a checkpoint and restart my program?

dmtcp
scheduler
qow

#1

I’m trying to run some code with Rscript, but the job is taking longer than the time allowed by the scheduler.
How can I use DMTCP to restart jobs that surpass the time limit?

Here is a very simplified version of my code:

my_function <- function(curr.seq, weights){
  
  # This function takes some time which I simulate here, using system sleep command
  Sys.sleep (100)
  
  return( rnorm(4) )
}
  

#simulate input parameters
set.seed(12345)
N <- 10000
a.seq <- sapply(1:N, FUN=function(x){paste0(sample(c("A","C","G","T"), 2000, replace=T), collapse="")})
weights <- c( 0.15,0.1,0.6,0.15)

#initialize matrix to be filled with computed values in the loop
result <- matrix(NA, nrow=N, ncol=4)
for ( i in 1:N ) {
  # here I perform a number of intermediate calculations 
  
  # call function that takes relatively long time to finish
  result[i,] <- my_function(a.seq[i],  weights)
  
  # Since the number of sequences this loop needs to go through is very large, 
  # I would like to add a DMTCP checkpoint here. How do I do this?
  
}

CURATOR: jpessin1


#2

Scratchpad answer (but has nothing to do with R). By default, the rand_vals program below will use a fixed seed, and generate the same random number sequence each time it starts:

renfro@gpunode004(job 159188) dmtcp]$ ls
rand_vals  rand_vals.cpp
[renfro@gpunode004(job 159188) dmtcp]$ ./rand_vals
x was 0, x is now 3499211612
x was 3499211612, x is now 581869302
^C
[renfro@gpunode004(job 159188) dmtcp]$ ./rand_vals
x was 0, x is now 3499211612
x was 3499211612, x is now 581869302
^C
[renfro@gpunode004(job 159188) dmtcp]$

If the program is launched with dmtcp_launch, it can checkpoint its memory on a specified interval (here, 2 seconds). Nothing will look any different on first launch, since the program will start from scratch again:

[renfro@gpunode004(job 159188) dmtcp]$ dmtcp_launch --interval 2 ./rand_vals
x was 0, x is now 3499211612
x was 3499211612, x is now 581869302
...
x was 3922919429, x is now 949333985
x was 949333985, x is now 2715962298
^C
[renfro@gpunode004(job 159188) dmtcp]$

One difference now is that dmtcp has written some memory state to the .dmtcp files in the folder:

[renfro@gpunode004(job 159188) dmtcp]$ ls
ckpt_rand_vals_757608755a1fd79a-40000-447e04b492543.dmtcp     rand_vals
dmtcp_restart_script_757608755a1fd79a-40000-447e00e0b99a8.sh  rand_vals.cpp
dmtcp_restart_script.sh

If the program is restarted via dmtcp_restart, it can load the last saved state, and not start from scratch:

[renfro@gpunode004(job 159188) dmtcp]$ dmtcp_restart --interval 2 ckpt_rand_vals_757608755a1fd79a-40000-447e04b492543.dmtcp
[41415] mtcp_restart.c:589 restorememoryareas:
  error restoring brk: 0
[40000] NOTE at processinfo.cpp:372 in restoreHeap; REASON='Area between saved_break and curr_break not mapped, mapping it now'
     _savedBrk = 6443008
     curBrk = 6467584
x was 949333985, x is now 2715962298
...