Is it ever recommended to use mean/multiple imputation when using tree-based predictive models?
$begingroup$
Everytime that I am making some predictive model and I have missing data I impute categorical variables with something like "UNKNOWN" and numerical variables with some absurd number that will never be seen in practice (even if the variable is unbounded I can take the exponent of the variable and make the unknown values negative).
The main advantage is that the model knows that the variable is missing, which is not the case for say mean imputation. I can see that this could be disastrous in linear models or neural networks but in tree-based models this is handled really smoothly.
I know that there is a great deal of literature on missing data imputation, but when and why would I ever use these methods when missing data for predictive (tree-based) models?
missing-data cart boosting data-imputation multiple-imputation
$endgroup$
add a comment |
$begingroup$
Everytime that I am making some predictive model and I have missing data I impute categorical variables with something like "UNKNOWN" and numerical variables with some absurd number that will never be seen in practice (even if the variable is unbounded I can take the exponent of the variable and make the unknown values negative).
The main advantage is that the model knows that the variable is missing, which is not the case for say mean imputation. I can see that this could be disastrous in linear models or neural networks but in tree-based models this is handled really smoothly.
I know that there is a great deal of literature on missing data imputation, but when and why would I ever use these methods when missing data for predictive (tree-based) models?
missing-data cart boosting data-imputation multiple-imputation
$endgroup$
 
 
 
 
 
 
 $begingroup$
 Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
 $endgroup$
 – astel
 2 hours ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
 $endgroup$
 – gsmafra
 2 hours ago
 
 
 
add a comment |
$begingroup$
Everytime that I am making some predictive model and I have missing data I impute categorical variables with something like "UNKNOWN" and numerical variables with some absurd number that will never be seen in practice (even if the variable is unbounded I can take the exponent of the variable and make the unknown values negative).
The main advantage is that the model knows that the variable is missing, which is not the case for say mean imputation. I can see that this could be disastrous in linear models or neural networks but in tree-based models this is handled really smoothly.
I know that there is a great deal of literature on missing data imputation, but when and why would I ever use these methods when missing data for predictive (tree-based) models?
missing-data cart boosting data-imputation multiple-imputation
$endgroup$
Everytime that I am making some predictive model and I have missing data I impute categorical variables with something like "UNKNOWN" and numerical variables with some absurd number that will never be seen in practice (even if the variable is unbounded I can take the exponent of the variable and make the unknown values negative).
The main advantage is that the model knows that the variable is missing, which is not the case for say mean imputation. I can see that this could be disastrous in linear models or neural networks but in tree-based models this is handled really smoothly.
I know that there is a great deal of literature on missing data imputation, but when and why would I ever use these methods when missing data for predictive (tree-based) models?
missing-data cart boosting data-imputation multiple-imputation
missing-data cart boosting data-imputation multiple-imputation
asked 2 hours ago
gsmafragsmafra
16518
16518
 
 
 
 
 
 
 $begingroup$
 Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
 $endgroup$
 – astel
 2 hours ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
 $endgroup$
 – gsmafra
 2 hours ago
 
 
 
add a comment |
 
 
 
 
 
 
 $begingroup$
 Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
 $endgroup$
 – astel
 2 hours ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
 $endgroup$
 – gsmafra
 2 hours ago
 
 
 
$begingroup$
Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
$endgroup$
– astel
2 hours ago
$begingroup$
Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
$endgroup$
– astel
2 hours ago
$begingroup$
The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
$endgroup$
– gsmafra
2 hours ago
$begingroup$
The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
$endgroup$
– gsmafra
2 hours ago
add a comment |
                                1 Answer
                            1
                        
active
oldest
votes
$begingroup$
One reason you may not want to use "insert impossible value" methods is that means that your predictive model works conditional on the distribution of the data missingness remaining unchanged. Thus, if after building your tree model, it is realized that we can start using certain features more often, we can no longer use the model that was built using the "impute impossible value" method without retraining the model.
In fact, this problem is even further compounded if the rates of missingness changes during the data collection process itself. Then, even immediately after building the model, it is already "out of date", as the current rates of missingness will be different than the rates of missingness during when the data was collected.
To illustrate the issue, let's suppose a bank is building a database to help predict if clients will default on a loan. Early in the data collection process, loan officers have the option to conduct a background investigation, but they almost never do for clients they deem as trustworthy. Thus, for the especially trustworthy customers, the background check variable is almost always missing. If you use the "impute impossible value" method, having a possible value for background checks indicates high risk.
If background check rates don't change at all, then this "impute impossible value" method will likely still provide valid predictions. However, let's suppose the bank realizes that background checks are really helpful for assessing risk, so they change their policy to include background checks for everyone. Then, everyone will have a possible value for background checks and using the "impute impossible value" method, everyone will be flagged as "high risk".
Cross validation will not catch this issue, as the missingness distribution will be the same between the training and testing sets. So even though the "impute impossible value" method may lead to pretty results during cross-validation, this will lead to poor predictions upon deployment!
Note that you will essentially need to throw away all your data everytime your data collection policy changes! Alternatively, if you can correctly impute the missing values and their uncertainty, you can now use the data that was collected under the old policy.
$endgroup$
 
 
 
 
 
 
 $begingroup$
 That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
 $endgroup$
 – gsmafra
 1 hour ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
 $endgroup$
 – Cliff AB
 55 mins ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
 $endgroup$
 – Cliff AB
 54 mins ago
 
 
 
 
 
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f397942%2fis-it-ever-recommended-to-use-mean-multiple-imputation-when-using-tree-based-pre%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
                                1 Answer
                            1
                        
active
oldest
votes
                                1 Answer
                            1
                        
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
One reason you may not want to use "insert impossible value" methods is that means that your predictive model works conditional on the distribution of the data missingness remaining unchanged. Thus, if after building your tree model, it is realized that we can start using certain features more often, we can no longer use the model that was built using the "impute impossible value" method without retraining the model.
In fact, this problem is even further compounded if the rates of missingness changes during the data collection process itself. Then, even immediately after building the model, it is already "out of date", as the current rates of missingness will be different than the rates of missingness during when the data was collected.
To illustrate the issue, let's suppose a bank is building a database to help predict if clients will default on a loan. Early in the data collection process, loan officers have the option to conduct a background investigation, but they almost never do for clients they deem as trustworthy. Thus, for the especially trustworthy customers, the background check variable is almost always missing. If you use the "impute impossible value" method, having a possible value for background checks indicates high risk.
If background check rates don't change at all, then this "impute impossible value" method will likely still provide valid predictions. However, let's suppose the bank realizes that background checks are really helpful for assessing risk, so they change their policy to include background checks for everyone. Then, everyone will have a possible value for background checks and using the "impute impossible value" method, everyone will be flagged as "high risk".
Cross validation will not catch this issue, as the missingness distribution will be the same between the training and testing sets. So even though the "impute impossible value" method may lead to pretty results during cross-validation, this will lead to poor predictions upon deployment!
Note that you will essentially need to throw away all your data everytime your data collection policy changes! Alternatively, if you can correctly impute the missing values and their uncertainty, you can now use the data that was collected under the old policy.
$endgroup$
 
 
 
 
 
 
 $begingroup$
 That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
 $endgroup$
 – gsmafra
 1 hour ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
 $endgroup$
 – Cliff AB
 55 mins ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
 $endgroup$
 – Cliff AB
 54 mins ago
 
 
 
 
 
add a comment |
$begingroup$
One reason you may not want to use "insert impossible value" methods is that means that your predictive model works conditional on the distribution of the data missingness remaining unchanged. Thus, if after building your tree model, it is realized that we can start using certain features more often, we can no longer use the model that was built using the "impute impossible value" method without retraining the model.
In fact, this problem is even further compounded if the rates of missingness changes during the data collection process itself. Then, even immediately after building the model, it is already "out of date", as the current rates of missingness will be different than the rates of missingness during when the data was collected.
To illustrate the issue, let's suppose a bank is building a database to help predict if clients will default on a loan. Early in the data collection process, loan officers have the option to conduct a background investigation, but they almost never do for clients they deem as trustworthy. Thus, for the especially trustworthy customers, the background check variable is almost always missing. If you use the "impute impossible value" method, having a possible value for background checks indicates high risk.
If background check rates don't change at all, then this "impute impossible value" method will likely still provide valid predictions. However, let's suppose the bank realizes that background checks are really helpful for assessing risk, so they change their policy to include background checks for everyone. Then, everyone will have a possible value for background checks and using the "impute impossible value" method, everyone will be flagged as "high risk".
Cross validation will not catch this issue, as the missingness distribution will be the same between the training and testing sets. So even though the "impute impossible value" method may lead to pretty results during cross-validation, this will lead to poor predictions upon deployment!
Note that you will essentially need to throw away all your data everytime your data collection policy changes! Alternatively, if you can correctly impute the missing values and their uncertainty, you can now use the data that was collected under the old policy.
$endgroup$
 
 
 
 
 
 
 $begingroup$
 That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
 $endgroup$
 – gsmafra
 1 hour ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
 $endgroup$
 – Cliff AB
 55 mins ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
 $endgroup$
 – Cliff AB
 54 mins ago
 
 
 
 
 
add a comment |
$begingroup$
One reason you may not want to use "insert impossible value" methods is that means that your predictive model works conditional on the distribution of the data missingness remaining unchanged. Thus, if after building your tree model, it is realized that we can start using certain features more often, we can no longer use the model that was built using the "impute impossible value" method without retraining the model.
In fact, this problem is even further compounded if the rates of missingness changes during the data collection process itself. Then, even immediately after building the model, it is already "out of date", as the current rates of missingness will be different than the rates of missingness during when the data was collected.
To illustrate the issue, let's suppose a bank is building a database to help predict if clients will default on a loan. Early in the data collection process, loan officers have the option to conduct a background investigation, but they almost never do for clients they deem as trustworthy. Thus, for the especially trustworthy customers, the background check variable is almost always missing. If you use the "impute impossible value" method, having a possible value for background checks indicates high risk.
If background check rates don't change at all, then this "impute impossible value" method will likely still provide valid predictions. However, let's suppose the bank realizes that background checks are really helpful for assessing risk, so they change their policy to include background checks for everyone. Then, everyone will have a possible value for background checks and using the "impute impossible value" method, everyone will be flagged as "high risk".
Cross validation will not catch this issue, as the missingness distribution will be the same between the training and testing sets. So even though the "impute impossible value" method may lead to pretty results during cross-validation, this will lead to poor predictions upon deployment!
Note that you will essentially need to throw away all your data everytime your data collection policy changes! Alternatively, if you can correctly impute the missing values and their uncertainty, you can now use the data that was collected under the old policy.
$endgroup$
One reason you may not want to use "insert impossible value" methods is that means that your predictive model works conditional on the distribution of the data missingness remaining unchanged. Thus, if after building your tree model, it is realized that we can start using certain features more often, we can no longer use the model that was built using the "impute impossible value" method without retraining the model.
In fact, this problem is even further compounded if the rates of missingness changes during the data collection process itself. Then, even immediately after building the model, it is already "out of date", as the current rates of missingness will be different than the rates of missingness during when the data was collected.
To illustrate the issue, let's suppose a bank is building a database to help predict if clients will default on a loan. Early in the data collection process, loan officers have the option to conduct a background investigation, but they almost never do for clients they deem as trustworthy. Thus, for the especially trustworthy customers, the background check variable is almost always missing. If you use the "impute impossible value" method, having a possible value for background checks indicates high risk.
If background check rates don't change at all, then this "impute impossible value" method will likely still provide valid predictions. However, let's suppose the bank realizes that background checks are really helpful for assessing risk, so they change their policy to include background checks for everyone. Then, everyone will have a possible value for background checks and using the "impute impossible value" method, everyone will be flagged as "high risk".
Cross validation will not catch this issue, as the missingness distribution will be the same between the training and testing sets. So even though the "impute impossible value" method may lead to pretty results during cross-validation, this will lead to poor predictions upon deployment!
Note that you will essentially need to throw away all your data everytime your data collection policy changes! Alternatively, if you can correctly impute the missing values and their uncertainty, you can now use the data that was collected under the old policy.
edited 1 hour ago
answered 2 hours ago


Cliff ABCliff AB
13.5k12567
13.5k12567
 
 
 
 
 
 
 $begingroup$
 That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
 $endgroup$
 – gsmafra
 1 hour ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
 $endgroup$
 – Cliff AB
 55 mins ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
 $endgroup$
 – Cliff AB
 54 mins ago
 
 
 
 
 
add a comment |
 
 
 
 
 
 
 $begingroup$
 That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
 $endgroup$
 – gsmafra
 1 hour ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
 $endgroup$
 – Cliff AB
 55 mins ago
 
 
 
 
 
 
 
 
 
 $begingroup$
 To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
 $endgroup$
 – Cliff AB
 54 mins ago
 
 
 
 
 
$begingroup$
That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
$endgroup$
– gsmafra
1 hour ago
$begingroup$
That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
$endgroup$
– gsmafra
1 hour ago
$begingroup$
@gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
$endgroup$
– Cliff AB
55 mins ago
$begingroup$
@gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
$endgroup$
– Cliff AB
55 mins ago
$begingroup$
To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
$endgroup$
– Cliff AB
54 mins ago
$begingroup$
To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
$endgroup$
– Cliff AB
54 mins ago
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f397942%2fis-it-ever-recommended-to-use-mean-multiple-imputation-when-using-tree-based-pre%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
$endgroup$
– astel
2 hours ago
$begingroup$
The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
$endgroup$
– gsmafra
2 hours ago