Are over-dispersion tests in GLMs actually *useful*?
$begingroup$
The phenomenon of 'over-dispersion' in a GLM arises whenever we use a model that restricts the variance of the response variable, and the data exhibits greater variance than the model restriction allows. This occurs commonly when modelling count data using a Poisson GLM, and it can be diagnosed by well-known tests. If tests show that there is statistically significant evidence of over-dispersion then we usually generalise the model by using a broader family of distributions that free the variance parameter from the restriction occurring under the original model. In the case of a Poisson GLM it is common to generalise either to a negative-binomial or quasi-Poisson GLM.
This situation is pregnant with an obvious objection. Why start with a Poisson GLM at all? One can start directly with the broader distributional forms, which have a (relatively) free variance parameter, and allow the variance parameter to be fit to the data, ignoring over-dispersion tests completely. In other situations when we are doing data analysis we almost always use distributional forms that allow freedom of at least the first two-moments, so why make an exception here?
My Question: Is there any good reason to start with a distribution that fixes the variance (e.g., the Poisson distribution) and then perform an over-dispersion test? How does this procedure compare with skipping this exercise completely and going straight to the more general models (e.g., negative-binomial, quasi-Poisson, etc.)? In other words, why not always use a distribution with a free variance parameter?
overdispersion
$endgroup$
add a comment |
$begingroup$
The phenomenon of 'over-dispersion' in a GLM arises whenever we use a model that restricts the variance of the response variable, and the data exhibits greater variance than the model restriction allows. This occurs commonly when modelling count data using a Poisson GLM, and it can be diagnosed by well-known tests. If tests show that there is statistically significant evidence of over-dispersion then we usually generalise the model by using a broader family of distributions that free the variance parameter from the restriction occurring under the original model. In the case of a Poisson GLM it is common to generalise either to a negative-binomial or quasi-Poisson GLM.
This situation is pregnant with an obvious objection. Why start with a Poisson GLM at all? One can start directly with the broader distributional forms, which have a (relatively) free variance parameter, and allow the variance parameter to be fit to the data, ignoring over-dispersion tests completely. In other situations when we are doing data analysis we almost always use distributional forms that allow freedom of at least the first two-moments, so why make an exception here?
My Question: Is there any good reason to start with a distribution that fixes the variance (e.g., the Poisson distribution) and then perform an over-dispersion test? How does this procedure compare with skipping this exercise completely and going straight to the more general models (e.g., negative-binomial, quasi-Poisson, etc.)? In other words, why not always use a distribution with a free variance parameter?
overdispersion
$endgroup$
$begingroup$
my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
$endgroup$
– mlofton
2 hours ago
1
$begingroup$
In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
$endgroup$
– Gordon Smyth
1 hour ago
$begingroup$
@GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
$endgroup$
– Cliff AB
1 hour ago
$begingroup$
@CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
$endgroup$
– Gordon Smyth
42 mins ago
add a comment |
$begingroup$
The phenomenon of 'over-dispersion' in a GLM arises whenever we use a model that restricts the variance of the response variable, and the data exhibits greater variance than the model restriction allows. This occurs commonly when modelling count data using a Poisson GLM, and it can be diagnosed by well-known tests. If tests show that there is statistically significant evidence of over-dispersion then we usually generalise the model by using a broader family of distributions that free the variance parameter from the restriction occurring under the original model. In the case of a Poisson GLM it is common to generalise either to a negative-binomial or quasi-Poisson GLM.
This situation is pregnant with an obvious objection. Why start with a Poisson GLM at all? One can start directly with the broader distributional forms, which have a (relatively) free variance parameter, and allow the variance parameter to be fit to the data, ignoring over-dispersion tests completely. In other situations when we are doing data analysis we almost always use distributional forms that allow freedom of at least the first two-moments, so why make an exception here?
My Question: Is there any good reason to start with a distribution that fixes the variance (e.g., the Poisson distribution) and then perform an over-dispersion test? How does this procedure compare with skipping this exercise completely and going straight to the more general models (e.g., negative-binomial, quasi-Poisson, etc.)? In other words, why not always use a distribution with a free variance parameter?
overdispersion
$endgroup$
The phenomenon of 'over-dispersion' in a GLM arises whenever we use a model that restricts the variance of the response variable, and the data exhibits greater variance than the model restriction allows. This occurs commonly when modelling count data using a Poisson GLM, and it can be diagnosed by well-known tests. If tests show that there is statistically significant evidence of over-dispersion then we usually generalise the model by using a broader family of distributions that free the variance parameter from the restriction occurring under the original model. In the case of a Poisson GLM it is common to generalise either to a negative-binomial or quasi-Poisson GLM.
This situation is pregnant with an obvious objection. Why start with a Poisson GLM at all? One can start directly with the broader distributional forms, which have a (relatively) free variance parameter, and allow the variance parameter to be fit to the data, ignoring over-dispersion tests completely. In other situations when we are doing data analysis we almost always use distributional forms that allow freedom of at least the first two-moments, so why make an exception here?
My Question: Is there any good reason to start with a distribution that fixes the variance (e.g., the Poisson distribution) and then perform an over-dispersion test? How does this procedure compare with skipping this exercise completely and going straight to the more general models (e.g., negative-binomial, quasi-Poisson, etc.)? In other words, why not always use a distribution with a free variance parameter?
overdispersion
overdispersion
asked 2 hours ago
BenBen
24.9k226117
24.9k226117
$begingroup$
my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
$endgroup$
– mlofton
2 hours ago
1
$begingroup$
In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
$endgroup$
– Gordon Smyth
1 hour ago
$begingroup$
@GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
$endgroup$
– Cliff AB
1 hour ago
$begingroup$
@CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
$endgroup$
– Gordon Smyth
42 mins ago
add a comment |
$begingroup$
my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
$endgroup$
– mlofton
2 hours ago
1
$begingroup$
In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
$endgroup$
– Gordon Smyth
1 hour ago
$begingroup$
@GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
$endgroup$
– Cliff AB
1 hour ago
$begingroup$
@CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
$endgroup$
– Gordon Smyth
42 mins ago
$begingroup$
my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
$endgroup$
– mlofton
2 hours ago
$begingroup$
my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
$endgroup$
– mlofton
2 hours ago
1
1
$begingroup$
In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
$endgroup$
– Gordon Smyth
1 hour ago
$begingroup$
In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
$endgroup$
– Gordon Smyth
1 hour ago
$begingroup$
@GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
$endgroup$
– Cliff AB
1 hour ago
$begingroup$
@GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
$endgroup$
– Cliff AB
1 hour ago
$begingroup$
@CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
$endgroup$
– Gordon Smyth
42 mins ago
$begingroup$
@CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
$endgroup$
– Gordon Smyth
42 mins ago
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
In principle, I actually agree that 99% of the time, it's better to just use the more flexible model. With that said, here are two and a half arguments for why you might not.
(1) Less flexible means more efficient estimates. Given that variance parameters tend to be less stable than mean parameters, your assumption of fixed mean-variance relation may stabilize standard errors more.
(2) Model checking. I've worked with physicists who believe that various measurements can be described by Poisson distributions due to theoretical physics. If we reject the hypothesis that mean = variance, we have evidence against the Poisson distribution hypothesis. As pointed out in a comment by @GordonSmyth, if you have reason to believe that a given measurement should follow a Poisson distribution, if you have evidence of over dispersion, you have evidence that you are missing important factors.
(2.5) Proper distribution. While the negative binomial regression comes from a valid statistical distribution, it's my understanding that the Quasi-Poisson does not. That means you can't really simulate count data if you believe $Var[y] = alpha E[y]$ for $alpha neq 1$. That might be annoying for some use cases. Likewise, you can't use probabilities to test for outliers, etc.
$endgroup$
$begingroup$
On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
$endgroup$
– Björn
24 mins ago
$begingroup$
@Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
$endgroup$
– Cliff AB
19 mins ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f392591%2fare-over-dispersion-tests-in-glms-actually-useful%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
In principle, I actually agree that 99% of the time, it's better to just use the more flexible model. With that said, here are two and a half arguments for why you might not.
(1) Less flexible means more efficient estimates. Given that variance parameters tend to be less stable than mean parameters, your assumption of fixed mean-variance relation may stabilize standard errors more.
(2) Model checking. I've worked with physicists who believe that various measurements can be described by Poisson distributions due to theoretical physics. If we reject the hypothesis that mean = variance, we have evidence against the Poisson distribution hypothesis. As pointed out in a comment by @GordonSmyth, if you have reason to believe that a given measurement should follow a Poisson distribution, if you have evidence of over dispersion, you have evidence that you are missing important factors.
(2.5) Proper distribution. While the negative binomial regression comes from a valid statistical distribution, it's my understanding that the Quasi-Poisson does not. That means you can't really simulate count data if you believe $Var[y] = alpha E[y]$ for $alpha neq 1$. That might be annoying for some use cases. Likewise, you can't use probabilities to test for outliers, etc.
$endgroup$
$begingroup$
On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
$endgroup$
– Björn
24 mins ago
$begingroup$
@Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
$endgroup$
– Cliff AB
19 mins ago
add a comment |
$begingroup$
In principle, I actually agree that 99% of the time, it's better to just use the more flexible model. With that said, here are two and a half arguments for why you might not.
(1) Less flexible means more efficient estimates. Given that variance parameters tend to be less stable than mean parameters, your assumption of fixed mean-variance relation may stabilize standard errors more.
(2) Model checking. I've worked with physicists who believe that various measurements can be described by Poisson distributions due to theoretical physics. If we reject the hypothesis that mean = variance, we have evidence against the Poisson distribution hypothesis. As pointed out in a comment by @GordonSmyth, if you have reason to believe that a given measurement should follow a Poisson distribution, if you have evidence of over dispersion, you have evidence that you are missing important factors.
(2.5) Proper distribution. While the negative binomial regression comes from a valid statistical distribution, it's my understanding that the Quasi-Poisson does not. That means you can't really simulate count data if you believe $Var[y] = alpha E[y]$ for $alpha neq 1$. That might be annoying for some use cases. Likewise, you can't use probabilities to test for outliers, etc.
$endgroup$
$begingroup$
On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
$endgroup$
– Björn
24 mins ago
$begingroup$
@Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
$endgroup$
– Cliff AB
19 mins ago
add a comment |
$begingroup$
In principle, I actually agree that 99% of the time, it's better to just use the more flexible model. With that said, here are two and a half arguments for why you might not.
(1) Less flexible means more efficient estimates. Given that variance parameters tend to be less stable than mean parameters, your assumption of fixed mean-variance relation may stabilize standard errors more.
(2) Model checking. I've worked with physicists who believe that various measurements can be described by Poisson distributions due to theoretical physics. If we reject the hypothesis that mean = variance, we have evidence against the Poisson distribution hypothesis. As pointed out in a comment by @GordonSmyth, if you have reason to believe that a given measurement should follow a Poisson distribution, if you have evidence of over dispersion, you have evidence that you are missing important factors.
(2.5) Proper distribution. While the negative binomial regression comes from a valid statistical distribution, it's my understanding that the Quasi-Poisson does not. That means you can't really simulate count data if you believe $Var[y] = alpha E[y]$ for $alpha neq 1$. That might be annoying for some use cases. Likewise, you can't use probabilities to test for outliers, etc.
$endgroup$
In principle, I actually agree that 99% of the time, it's better to just use the more flexible model. With that said, here are two and a half arguments for why you might not.
(1) Less flexible means more efficient estimates. Given that variance parameters tend to be less stable than mean parameters, your assumption of fixed mean-variance relation may stabilize standard errors more.
(2) Model checking. I've worked with physicists who believe that various measurements can be described by Poisson distributions due to theoretical physics. If we reject the hypothesis that mean = variance, we have evidence against the Poisson distribution hypothesis. As pointed out in a comment by @GordonSmyth, if you have reason to believe that a given measurement should follow a Poisson distribution, if you have evidence of over dispersion, you have evidence that you are missing important factors.
(2.5) Proper distribution. While the negative binomial regression comes from a valid statistical distribution, it's my understanding that the Quasi-Poisson does not. That means you can't really simulate count data if you believe $Var[y] = alpha E[y]$ for $alpha neq 1$. That might be annoying for some use cases. Likewise, you can't use probabilities to test for outliers, etc.
edited 16 mins ago
answered 2 hours ago
Cliff ABCliff AB
12.8k12363
12.8k12363
$begingroup$
On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
$endgroup$
– Björn
24 mins ago
$begingroup$
@Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
$endgroup$
– Cliff AB
19 mins ago
add a comment |
$begingroup$
On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
$endgroup$
– Björn
24 mins ago
$begingroup$
@Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
$endgroup$
– Cliff AB
19 mins ago
$begingroup$
On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
$endgroup$
– Björn
24 mins ago
$begingroup$
On 2.5: There's of course negative binomial and GLMM with random effects that don't have that limitation.
$endgroup$
– Björn
24 mins ago
$begingroup$
@Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
$endgroup$
– Cliff AB
19 mins ago
$begingroup$
@Björn: that's why it's only half an argument; only applies to Quasi-Likelihood methods. As far as I know, there are no likelihood based methods for under dispersion, even though this can be analyzed with a Quasi-Likelihood model.
$endgroup$
– Cliff AB
19 mins ago
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f392591%2fare-over-dispersion-tests-in-glms-actually-useful%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
my guess is that, if the underlying truly is poisson, then your glm result will not exhibit those well known-good properties like estimates also being efficient in the sense of the variance of the estimates being greater than it needs to be, if the correct model had been used. Estimates are probably not even unbiased or MLE's. But that's just my intuition and I could be wrong. I'd be curious what a good answer is.
$endgroup$
– mlofton
2 hours ago
1
$begingroup$
In my experience, testing for over-dispersion is (paradoxically) mainly of use when you know (from a knowledge of the data generation process) that over-dispersion can't be present. In this context, testing for over-dispersion tells you whether the linear model is picking up all the signal in the data. If it isn't, then adding more covariates to the model should be considered. If it is, then more covariates cannot help.
$endgroup$
– Gordon Smyth
1 hour ago
$begingroup$
@GordonSmyth: I think that's a good answer. If you don't want to turn that into its own answer, I'll fold it into mine.
$endgroup$
– Cliff AB
1 hour ago
$begingroup$
@CliffAB Feel free to incorporate my comment into your answer as I don't have time to compose a full answer myself.
$endgroup$
– Gordon Smyth
42 mins ago