89

When building a model in R, how do you save the model specifications such that you can reuse it on new data? Let's say I build a logistic regression on historical data but won't have new observations until next month. What's the best approach?

Things that I have considered:

  • Saving the model object and loading in a new session
  • I know that some models can be exported with PMML, but haven't really seen anything about importing PMML

Simply, I am trying to get a sense of what you do when you need to use your model in a new session.

4
  • Well, you can always "save" a model formula, and provide updated data in data argument... assuming that I understood you correctly...
    – aL3xa
    Feb 25, 2011 at 14:20
  • Hmm, what do you mean by re-use? Predict for the new observations or update the model fit to use the new observations plus the old ones? Feb 25, 2011 at 14:24
  • @Gavin. I want to use the model that I developed to predict new values on data that I do not have yet and might not have for some time.
    – Btibert3
    Feb 25, 2011 at 14:58
  • 1
    @Bitbert3 OK, then the opening section of my answer is what I would do. Saving the model object out to disk is more than acceptable, but it is important to save the R code/script used to generate the model in the first place so that your research/modelling is reproducible. Feb 25, 2011 at 15:09

2 Answers 2

151

Reusing a model to predict for new observations

If the model is not computationally costly, I tend to document the entire model building process in an R script that I rerun when needed. If a random element is involved in the model fitting, I make sure to set a known random seed.

If the model is computationally costly to compute, then I still use a script as above, but save out the model objects using save() into and rda object. I then tend to modify the script such that if the saved object exists, load it, or if not, refit the model, using a simple if()...else clause wrapped around the relevant parts of the code.

When loading your saved model object, be sure to reload any required packages, although in your case if the logit model were fit via glm() there will not be any additional packages to load beyond R.

Here is an example:

> set.seed(345)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> ## save this model
> save(m1, file = "my_model1.rda")
> 
> ## a month later, new observations are available: 
> newdf <- data.frame(x = rnorm(20))
> ## load the model
> load("my_model1.rda")
> ## predict for the new `x`s in `newdf`
> predict(m1, newdata = newdf)
        1         2         3         4         5         6 
6.1370366 6.5631503 2.9808845 5.2464261 4.6651015 3.4475255 
        7         8         9        10        11        12 
6.7961764 5.3592901 3.3691800 9.2506653 4.7562096 3.9067537 
       13        14        15        16        17        18 
2.0423691 2.4764664 3.7308918 6.9999064 2.0081902 0.3256407 
       19        20 
5.4247548 2.6906722 

If wanting to automate this, then I would probably do the following in a script:

## data
df <- data.frame(x = rnorm(20))
df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))

## check if model exists? If not, refit:
if(file.exists("my_model1.rda")) {
    ## load model
    load("my_model1.rda")
} else {
    ## (re)fit the model
    m1 <- lm(y ~ x, data = df)
}

## predict for new observations
## new observations
newdf <- data.frame(x = rnorm(20))
## predict
predict(m1, newdata = newdf)

Of course, the data generation code would be replaced by code loading your actual data.

Updating a previously fitted model with new observations

If you want to refit the model using additional new observations. Then update() is a useful function. All it does is refit the model with one or more of the model arguments updated. If you want to include new observations in the data used to fit the model, add the new observations to the data frame passed to argument 'data', and then do the following:

m2 <- update(m1, . ~ ., data = df)

where m1 is the original, saved model fit, . ~ . is the model formula changes, which in this case means include all existing variables on both the left and right hand sides of ~ (in other words, make no changes to the model formula), and df is the data frame used to fit the original model, expanded to include the newly available observations.

Here is a working example:

> set.seed(123)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> m1

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x  
      4.960        2.222  

> 
> ## new observations
> newdf <- data.frame(x = rnorm(20))
> newdf <- transform(newdf, y = 5 + (2.3 * x) + rnorm(20))
> ## add on to df
> df <- rbind(df, newdf)
> 
> ## update model fit
> m2 <- update(m1, . ~ ., data = df)
> m2

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x  
      4.928        2.187

Other have mentioned in comments formula(), which extracts the formula from a fitted model:

> formula(m1)
y ~ x
> ## which can be used to set-up a new model call
> ## so an alternative to update() above is:
> m3 <- lm(formula(m1), data = df)

However, if the model fitting involves additional arguments, like 'family', or 'subset' arguments in more complex model fitting functions. If update() methods are available for your model fitting function (which they are for many common fitting functions, like glm()), it provides a simpler way to update a model fit than extracting and reusing the model formula.

If you intend to do all the modelling and future prediction in R, there doesn't really seem much point in abstracting the model out via PMML or similar.

6
  • 1
    +1 and if you would kindly resist from editing your answers to fit in whatever answer I was preparing... ;-)
    – Joris Meys
    Feb 25, 2011 at 15:22
  • 1
    @Joris ain't precognition a bitch! ;-) +1 for update from me Feb 25, 2011 at 15:38
  • 1
    This is a really great answer. I hope someone curates the SO [r] answers like this one and puts them together as a tutorial.
    – JD Long
    Feb 25, 2011 at 20:12
  • 1
    Excellent answer. Thanks for the examples you have given.
    – nhern121
    Nov 16, 2012 at 21:15
  • 1
    Exactly what I was looking for. I want to do +1000... Thank you
    – Adjeiinfo
    Aug 25, 2015 at 4:45
8

If you use the same name of the dataframe and variables, you can (at least for lm() and glm() ) use the function update on the saved model :

Df <- data.frame(X=1:10,Y=(1:10)+rnorm(10))

model <- lm(Y~X,data=Df)
model

Df <- rbind(Df,data.frame(X=2:11,Y=(10:1)+rnorm(10)))

update(model)

This is off course without any preparation of the data and so forth. It just reuses the model specifications set. Be aware that if you change the contrasts in the meantime, the new model gets updated with the new contrasts, not the old.

So the use of a script is in most cases the better answer. One could include all steps in a convenience function that just takes the dataframe, so you can source the script and then use the function on any new dataset. See also the answer of Gavin for that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.