All Levels of a Factor in a Model Matrix in R

Question

I have a data.frame consisting of numeric and factor variables as seen below.

testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

I want to build out a matrix that assigns dummy variables to the factor and leaves the numeric variables alone.

model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)

As expected when running lm this leaves out one level of each factor as the reference level. However, I want to build out a matrix with a dummy/indicator variable for every level of all the factors. I am building this matrix for glmnet so I am not worried about multicollinearity.

Is there a way to have model.matrix create the dummy for every level of the factor?

Gavin Simpson · Accepted Answer · 2014-01-06 16:59:48Z

72

(Trying to redeem myself...) In response to Jared's comment on @Fabians answer about automating it, note that all you need to supply is a named list of contrast matrices. contrasts() takes a vector/factor and produces the contrasts matrix from it. For this then we can use lapply() to run contrasts() on each factor in our data set, e.g. for the testFrame example provided:

> lapply(testFrame[,4:5], contrasts, contrasts = FALSE)
$Fourth
        Alice Bob Charlie David
Alice       1   0       0     0
Bob         0   1       0     0
Charlie     0   0       1     0
David       0   0       0     1

$Fifth
        Edward Frank Georgia Hank Isaac
Edward       1     0       0    0     0
Frank        0     1       0    0     0
Georgia      0     0       1    0     0
Hank         0     0       0    1     0
Isaac        0     0       0    0     1

Which slots nicely into @fabians answer:

model.matrix(~ ., data=testFrame, 
             contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))

edited Jan 6, 2014 at 16:59

answered Dec 31, 2010 at 9:26

Gavin Simpson

173k25 gold badges402 silver badges456 bronze badges

26

+1. nice. you can automate it even more by replacing 4:5 with sapply(testFrame, is.factor)
– fabians
Dec 31, 2010 at 18:05
Great solution for automation. Between the two of you my question has been answered perfectly, so I'm not sure whose answer should get the mark as the "Accepted Answer." I want you both to get credit.
– Jared
Jan 2, 2011 at 2:48
8

@Jared: @fabians was the answer you were looking for, so he should get the credit - my contribution was just a little bit of sugar on top.
– Gavin Simpson
Jan 2, 2011 at 10:27

Add a comment |

fabians · Accepted Answer · 2010-12-30 09:38:21Z

55

You need to reset the contrasts for the factor variables:

model.matrix(~ Fourth + Fifth, data=testFrame, 
        contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F), 
                Fifth=contrasts(testFrame$Fifth, contrasts=F)))

or, with a little less typing and without the proper names:

model.matrix(~ Fourth + Fifth, data=testFrame, 
    contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)), 
            Fifth=diag(nlevels(testFrame$Fifth))))

answered Dec 30, 2010 at 9:38

fabians

3,44324 silver badges23 bronze badges

14

That completely worked and I'll take that answer, but if I'm entering in 20 factors is there a way to universally do that for all variables in a frame or am I destined to typing way too much?
– Jared
Dec 31, 2010 at 0:16

Add a comment |

Pablo Casas · Accepted Answer · 2016-12-28 18:08:50Z

caret implemented a nice function dummyVars to achieve this with 2 lines:

library(caret) dmy <- dummyVars(" ~ .", data = testFrame) testFrame2 <- data.frame(predict(dmy, newdata = testFrame))

Checking the final columns:

colnames(testFrame2)

"First"  "Second"         "Third"          "Fourth.Alice"   "Fourth.Bob"     "Fourth.Charlie" "Fourth.David"   "Fifth.Edward"   "Fifth.Frank"   "Fifth.Georgia"  "Fifth.Hank"     "Fifth.Isaac"

The nicest point here is you get the original data frame, plus the dummy variables having excluded the original ones used for the transformation.

More info: http://amunategui.github.io/dummyVar-Walkthrough/

Sagar Jauhari · Accepted Answer · 2013-03-14 02:29:10Z

12

dummyVars from caret could also be used. http://caret.r-forge.r-project.org/preprocess.html

answered Mar 14, 2013 at 2:29

Sagar Jauhari

5877 silver badges13 bronze badges

Seems nice, but doesn't include an intercept and I can't seem to force it to.
– Jared
Mar 14, 2013 at 17:06
2

@jared: It works for me. Example: require(caret); (df <- data.frame(x1=c('a','b'), x2=1:2)); dummies <- dummyVars(x2~ ., data = df); predict(dummies, newdata = df)
– Andrew
Dec 30, 2015 at 23:00
1

@Jared no need for intercept when you have a dummy variable for every level of the factor.
– Will Townes
Mar 30, 2016 at 0:50
1

@Jared: This add intercept column: require(caret); (df <- data.frame(x1=c('a','b'), x2=1:2)); dummies <- dummyVars(x2~ ., data = df); predict(dummies, newdata = df); cbind(1, predict(dummies, newdata = df))
– MYaseen208
Nov 10, 2017 at 7:58

Add a comment |

user36302 · Accepted Answer · 2014-07-24 18:11:57Z

Ok. Just reading the above and putting it all together. Suppose you wanted the matrix e.g. 'X.factors' that multiplies by your coefficient vector to get your linear predictor. There are still a couple extra steps:

X.factors = 
  model.matrix( ~ ., data=X, contrasts.arg = 
    lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
                                             contrasts, contrasts = FALSE))

(Note that you need to turn X[*] back into a data frame in case you have only one factor column.)

Then say you get something like this:

attr(X.factors,"assign")
[1]  0  1  **2**  2  **3**  3  3  **4**  4  4  5  6  7  8  9 10 #emphasis added

We want to get rid of the **'d reference levels of each factor

att = attr(X.factors,"assign")
factor.columns = unique(att[duplicated(att)])
unwanted.columns = match(factor.columns,att)
X.factors = X.factors[,-unwanted.columns]
X.factors = (data.matrix(X.factors))

BTW why is this not built in to base R? It seems like you'd need it every time you run a simulation. — user36302, Jul 24, 2014 at 18:14

shosaco · Accepted Answer · 2019-02-16 09:43:12Z

A tidyverse answer:

library(dplyr)
library(tidyr)
result <- testFrame %>% 
    mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>% 
    mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "")

yields the desired result (same as @Gavin Simpson's answer):

> head(result, 6)
  First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac
1     1      5     4           0         0             1           0           0          1            0         0          0
2     1     14    10           0         0             0           1           0          0            1         0          0
3     2      2     9           0         1             0           0           1          0            0         0          0
4     2      5     4           0         0             0           1           0          1            0         0          0
5     2     13     5           0         0             1           0           1          0            0         0          0
6     2     15     7           1         0             0           0           1          0            0         0          0

asdf123 · Accepted Answer · 2016-09-14 01:56:17Z

Using the R package 'CatEncoders'

library(CatEncoders)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

fit <- OneHotEncoder.fit(testFrame)

z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output
z <- transform(fit,testFrame,sparse=FALSE) # give the dense output

Mankind_2000 · Accepted Answer · 2018-06-24 07:13:34Z

2

I am currently learning Lasso model and glmnet::cv.glmnet(), model.matrix() and Matrix::sparse.model.matrix()(for high dimensions matrix, using model.matrix will killing our time as suggested by the author of glmnet.).

Just sharing there has a tidy coding to get the same answer as @fabians and @Gavin's answer. Meanwhile, @asdf123 introduced another package library('CatEncoders') as well.

> require('useful')
> # always use all levels
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE)
> 
> # just use all levels for Fourth
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))

Source : R for Everyone: Advanced Analytics and Graphics (page273)

edited Jun 24, 2018 at 7:13

Mankind_2000

2,1882 gold badges9 silver badges15 bronze badges

answered Jan 15, 2017 at 17:59

Rγσ ξηg Lιαη Ημ 雷欧

4962 gold badges9 silver badges24 bronze badges

Thanks for the answer. The funny thing is, the build.x function was written by me and made possible by the answers from @fabiens and @gavin! And that's my book! So cool this came full circle. Thanks for reading!
– Jared
Feb 17, 2019 at 6:32

Add a comment |

Gregor Thomas · Accepted Answer · 2019-07-11 20:22:17Z

2

model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame)

or

model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame)

should be the most straightforward

edited Jul 11, 2019 at 20:22

Gregor Thomas

143k20 gold badges177 silver badges302 bronze badges

answered Sep 4, 2015 at 8:05

Federico Rotolo

573 bronze badges

This will work well if there is only one factor, but if there are multiple factors there will still be reference levels omitted.
– Gregor Thomas
Jul 11, 2019 at 20:22

Add a comment |

Ben2018 · Accepted Answer · 2021-08-11 19:28:22Z

I write a package called ModelMatrixModel to improve the functionality of model.matrix(). The ModelMatrixModel() function in the package in default return a class containing a sparse matrix with all levels of dummy variables which is suitable for input in cv.glmnet() in glmnet package. Importantly, returned class also stores transforming parameters such as the factor level information, which can then be applied to new data. The function can hand most items in r formula like poly() and interaction. It also gives several other options like handle invalid factor levels , and scale output.

#devtools::install_github("xinyongtian/R_ModelMatrixModel")
library(ModelMatrixModel)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
                        Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
                        Fourth=rep(c("Alice","Bob","Charlie","David"), 5))
newdata=data.frame(First=sample(1:10, 2, replace=T),
                   Second=sample(1:20, 2, replace=T), Third=sample(1:10, 2, replace=T),
                   Fourth=c("Bob","Charlie"))
mm=ModelMatrixModel(~First+Second+Fourth, data = testFrame)
class(mm)
## [1] "ModelMatrixModel"
class(mm$x) #default output is sparse matrix
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
data.frame(as.matrix(head(mm$x,2)))
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     7     17           1         0             0           0
## 2     9      7           0         1             0           0

#apply the same transformation to new data, note the dummy variables for 'Fourth' includes the levels not appearing in new data     
mm_new=predict(mm,newdata)
data.frame(as.matrix(head(mm_new$x,2))) 
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     6      3           0         1             0           0
## 2     2     12           0         0             1           0

Paul · Accepted Answer · 2020-03-27 02:15:15Z

You can use tidyverse to achieve this without specifying each column manually.

The trick is to make a "long" dataframe.

Then, munge a few things, and spread it back to wide to create the indicators/dummy variables.

Code:

library(tidyverse)

## add index variable for pivoting
testFrame$id <- 1:nrow(testFrame)

testFrame %>%
    ## pivot to "long" format
    gather(feature, value, -id) %>%
    ## add indicator value
    mutate(indicator=1) %>%
    ## create feature name that unites a feature and its value
    unite(feature, value, col="feature_value", sep="_") %>%
    ## convert to wide format, filling missing values with zero
    spread(feature_value, indicator, fill=0)

The output:

   id Fifth_Edward Fifth_Frank Fifth_Georgia Fifth_Hank Fifth_Isaac First_2 First_3 First_4 ...
1   1            1           0             0          0           0       0       0       0
2   2            0           1             0          0           0       0       0       0
3   3            0           0             1          0           0       0       0       0
4   4            0           0             0          1           0       0       0       0
5   5            0           0             0          0           1       0       0       0
6   6            1           0             0          0           0       0       0       0
7   7            0           1             0          0           0       0       1       0
8   8            0           0             1          0           0       1       0       0
9   9            0           0             0          1           0       0       0       0
10 10            0           0             0          0           1       0       0       0
11 11            1           0             0          0           0       0       0       0
12 12            0           1             0          0           0       0       0       0
...

Collectives™ on Stack Overflow

All Levels of a Factor in a Model Matrix in R

11 Answers 11

Your Answer

Not the answer you're looking for? Browse other questions tagged
r
matrix
model
indicator
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged rmatrixmodelindicator or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
r
matrix
model
indicator
or ask your own question.