Recall the definition and rationale of the RF model
Evaluate prediction errors by OOB-errors
Suppose that i number the records in my data: \(1,2,\ldots n\).
Say \(n=6\).
I build a bootstrap sample:
\[211343\]
This sample has left some records that were present in my original dataset (records 5 and 6 are missing).
Therefore, in THIS bootstrap sample, the fraction of original records that is missing in this FIXED bootstrap sample is
\[2/6=1/3\]
If I build another bootstrap sample, this missing fraction, which I call \(MF\) may be different:
Here is another bootstrap sample:
\[561324\]
What is the value of \(MF\)?
As you can see, \(MF\) is a random variable.
Let us ask the question: what is the distribution of this random variable?
I am interested in the chance that \(MF=p\) where p is a given proportion or missing fraction in a random bootstrap sample.
OK, I will note this:
\[f(x)=\mathbb{P}(MF=x)\]
So I am interested, ideally, in the calculation of \(f(x)\).
Let us be less demanding. I am only going to ask -least I can do- what the average or expectation of this distribution is?
So I want to calculate \[p_m=\mathbb{E}(MF)\].
This is the expected missing fraction, or proportion of original data records missing on average over many independent draws of bootstrap samples.
Note that when we build a random forest, that is what we do: draw many (\(ntree\)) independent bootstrap samples.
OK. Let us calculate \(p_m\).
So the question is to characterize when a record is missing.
I formulate this event in the following manner:
Let us take any fixed record. Someone is choosing the record number for us, randomly. For example, it could be ‘record 5’.
I will call this record number \(i\).
I assume that my bootstrap sample has a length \(n\).
When is record \(i\) ABSENT (missing) from the bootstrap sample?
When it is not the first bootstrap sample record (bsr) AND it is not the second bsr AND it is not at position 3 in the bsr, AND it is not at position 4 in the bsr AND it is not at position 5 in the bsr AND it is not at position 6 in the bsr AND it is not at position 7 in the bsr AND it is not at position 8 in the bsr AND it is not at position 9 in the bsr (and so on) AND it is not at position \(n\) in the bsr.
What is the probability of this event, \(A(i)\)?
All draws that choose a record to insert into the next position of the bootstrap sample are independent.
Therefore the probability of the CONJUNCTION of subevents above is the product of the probabilities of each sub-event (at position \(k\) in the boostrap): \(A(i,k)\): \(i\) is not in the bsr at position \(k\).
OK. What is this probability? It is the chance of drawing any record in the data (of \(n\) items) except record \(i\). So this chance is \(1-1/n\) as we have one chance in \(n\)to draw record \(i\) from the original data (because all records are uniformly weighted/have equal importance/ are equally likely to be drawn).
Therefore the expectated fraction of missing data is:
\[p_m = (1-1/n)(1-1/n)\ldots (1-1/n)=\prod_{k=1}^n (1-1/n)\]
OK. This is complicated. So let me assume that I will want to see the limit of this when \(n\) is large. This will eliminate all indices
To do this, I take the logarithm:
\[\log(p_m) = n \log(1-1/n)\] But remember that \(\log(1+x)\) is very close to \(x\) when x is a small number.
Try it yourself, compare \(\log(1+x)\) and \(x\) for \(x=0.1,0.01,0.001\)…
If you wonder why this is the case, just think about the Taylor development of the function \(g(x)=\log(1+x)\) near the point \(x=0\).
OK. So, I can approximate my expression (with \(x=1/n\)) and I find:
\[\log(p_m) \approxeq \frac{-n}{n} = -1\]
Therefore, taking the exponential on both sides:
\[p_m = e^{-1}\]
Calculate this and you will find something of the order of 1/3.
Now let us move to the next idea.
So, while you may say: this is horrible, my bootstrap sample ignore a lot of the data that I have so patiently assembled, we have to make two points.
First, different boostrap samples will typically ignore different subsets of the original data. Hence, provided your learning technique (model training) does not use a single or too few bootstrap samples, no GIVEN part of your dataset will be SYSTEMATICALLY ignored. In other words, many bootstrap sample will still cover all your original data records.
Second, missing records are an opportunity rather than a hindrance. We call them out of bag (or OOB) observations.
So, because, we know , when we build each tree based on some given bootstrap sample, which observations are OOB, why not, on the fly, doing a prediction? The idea here is to say: look, I have not use this record (say \(i=3\)) during construction/training of the present model (say based on bootstrap replicate number \(b\)), so it is fair to treat this as a ‘testing’ set of covariates (i.e. we use all the predictors \(x_i\) of record \(i\)) and apply my model \(\tau_b\) to do a prediction and compare the prediction to the truth, so that I have an idea of the prediction error made by my model.
This is almost correct but there is a flaw. We must do the prediction with the TEAM (the forest), and not with a single tree!
Therefore we slightly amend the method. We collect all the trees \(b \in B(i)\) that did not see record \(i\) as a training record, and we make an ensemble prediction over this sub-forest. This is equivalent to applying randomforest to a subset of trees. Prediction works by choosing the response that collects the largest number of votes, as in RF.
So we get a misclassification rate for the response \(Y(i)\). And by collecting (averaging) all these misclassification rates, we get a total measure of error for the overall RF model, which is called OOB-error rate.
To summarize, we have described the procedure proposed by Leo Breiman to evaluate the performance of RandomForest right at fitting time. This idea is simple and very elegant because a single call to the randomforest fitting function returns also the OOB error rate, without any need for validation.