Why PCA is a suboptimal of maximization of mutual information
$begingroup$
"A common theme in linear compression and feature extraction is to map a high dimensional vector $x$ to a lower dimensional vector $y=Wx$ such that the information in the vector $x$ is maximally preserved in $y$. Opten PCA is applied for this purpose. However, the optimal setting for $W$ is in generall not given by the widely used PCA. Actually, PCA is sub-optimal special case of mutual information maximisation."
Can anyone elaborate why PCA is a sub-optimal special case of mutual information maximisation ?
probability optimization information-theory
$endgroup$
add a comment |
$begingroup$
"A common theme in linear compression and feature extraction is to map a high dimensional vector $x$ to a lower dimensional vector $y=Wx$ such that the information in the vector $x$ is maximally preserved in $y$. Opten PCA is applied for this purpose. However, the optimal setting for $W$ is in generall not given by the widely used PCA. Actually, PCA is sub-optimal special case of mutual information maximisation."
Can anyone elaborate why PCA is a sub-optimal special case of mutual information maximisation ?
probability optimization information-theory
$endgroup$
$begingroup$
The context of the quote is likely important here. Could you please link to the reference you draw this statement from?
$endgroup$
– stochasticboy321
Dec 3 '18 at 1:00
$begingroup$
@stochasticboy321 thanks for the comment. It is in the paper "IM algorithm : a variational approach to Information Maximization"
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 9 '18 at 22:13
add a comment |
$begingroup$
"A common theme in linear compression and feature extraction is to map a high dimensional vector $x$ to a lower dimensional vector $y=Wx$ such that the information in the vector $x$ is maximally preserved in $y$. Opten PCA is applied for this purpose. However, the optimal setting for $W$ is in generall not given by the widely used PCA. Actually, PCA is sub-optimal special case of mutual information maximisation."
Can anyone elaborate why PCA is a sub-optimal special case of mutual information maximisation ?
probability optimization information-theory
$endgroup$
"A common theme in linear compression and feature extraction is to map a high dimensional vector $x$ to a lower dimensional vector $y=Wx$ such that the information in the vector $x$ is maximally preserved in $y$. Opten PCA is applied for this purpose. However, the optimal setting for $W$ is in generall not given by the widely used PCA. Actually, PCA is sub-optimal special case of mutual information maximisation."
Can anyone elaborate why PCA is a sub-optimal special case of mutual information maximisation ?
probability optimization information-theory
probability optimization information-theory
asked Dec 1 '18 at 15:53
SoManyProb_for_a_broken_heart.SoManyProb_for_a_broken_heart.
5201717
5201717
$begingroup$
The context of the quote is likely important here. Could you please link to the reference you draw this statement from?
$endgroup$
– stochasticboy321
Dec 3 '18 at 1:00
$begingroup$
@stochasticboy321 thanks for the comment. It is in the paper "IM algorithm : a variational approach to Information Maximization"
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 9 '18 at 22:13
add a comment |
$begingroup$
The context of the quote is likely important here. Could you please link to the reference you draw this statement from?
$endgroup$
– stochasticboy321
Dec 3 '18 at 1:00
$begingroup$
@stochasticboy321 thanks for the comment. It is in the paper "IM algorithm : a variational approach to Information Maximization"
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 9 '18 at 22:13
$begingroup$
The context of the quote is likely important here. Could you please link to the reference you draw this statement from?
$endgroup$
– stochasticboy321
Dec 3 '18 at 1:00
$begingroup$
The context of the quote is likely important here. Could you please link to the reference you draw this statement from?
$endgroup$
– stochasticboy321
Dec 3 '18 at 1:00
$begingroup$
@stochasticboy321 thanks for the comment. It is in the paper "IM algorithm : a variational approach to Information Maximization"
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 9 '18 at 22:13
$begingroup$
@stochasticboy321 thanks for the comment. It is in the paper "IM algorithm : a variational approach to Information Maximization"
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 9 '18 at 22:13
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
An easy to explain example is if you have two sets of functions very corrupted by high energy noise. You want to find what parts / subsets / linear combinations of these correspond to each other the most.
If we just go for PCA it will optimize subspaces looking for dimensions of highest L2 norm in different senses, but if our noise has higher L2-norm than functions of interest it will rather select noise than functions of interest! And we know that independently sampled uncorrelated noise will have very low mutual information with just about anything deterministic of interest.
Therefore we will do better if we search for a method which does not focus so much on norm of actual signal/function but on some statistical correspondence like... for example, cross correlation or covariance.
$endgroup$
$begingroup$
thank you very much for the reply. could you elaborate why "independently sampled uncorrelated noise will have very low mutual information" ? Does that mean that if I have a strong energy noise and a low energy, the mutual information between them would be low ? If that is the case, how does the mutual information term can be put into helping the optimization of $W$ ? could you write it in an arg max formula ? I really appreciate the explanation
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 10 '18 at 21:58
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3021492%2fwhy-pca-is-a-suboptimal-of-maximization-of-mutual-information%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
An easy to explain example is if you have two sets of functions very corrupted by high energy noise. You want to find what parts / subsets / linear combinations of these correspond to each other the most.
If we just go for PCA it will optimize subspaces looking for dimensions of highest L2 norm in different senses, but if our noise has higher L2-norm than functions of interest it will rather select noise than functions of interest! And we know that independently sampled uncorrelated noise will have very low mutual information with just about anything deterministic of interest.
Therefore we will do better if we search for a method which does not focus so much on norm of actual signal/function but on some statistical correspondence like... for example, cross correlation or covariance.
$endgroup$
$begingroup$
thank you very much for the reply. could you elaborate why "independently sampled uncorrelated noise will have very low mutual information" ? Does that mean that if I have a strong energy noise and a low energy, the mutual information between them would be low ? If that is the case, how does the mutual information term can be put into helping the optimization of $W$ ? could you write it in an arg max formula ? I really appreciate the explanation
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 10 '18 at 21:58
add a comment |
$begingroup$
An easy to explain example is if you have two sets of functions very corrupted by high energy noise. You want to find what parts / subsets / linear combinations of these correspond to each other the most.
If we just go for PCA it will optimize subspaces looking for dimensions of highest L2 norm in different senses, but if our noise has higher L2-norm than functions of interest it will rather select noise than functions of interest! And we know that independently sampled uncorrelated noise will have very low mutual information with just about anything deterministic of interest.
Therefore we will do better if we search for a method which does not focus so much on norm of actual signal/function but on some statistical correspondence like... for example, cross correlation or covariance.
$endgroup$
$begingroup$
thank you very much for the reply. could you elaborate why "independently sampled uncorrelated noise will have very low mutual information" ? Does that mean that if I have a strong energy noise and a low energy, the mutual information between them would be low ? If that is the case, how does the mutual information term can be put into helping the optimization of $W$ ? could you write it in an arg max formula ? I really appreciate the explanation
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 10 '18 at 21:58
add a comment |
$begingroup$
An easy to explain example is if you have two sets of functions very corrupted by high energy noise. You want to find what parts / subsets / linear combinations of these correspond to each other the most.
If we just go for PCA it will optimize subspaces looking for dimensions of highest L2 norm in different senses, but if our noise has higher L2-norm than functions of interest it will rather select noise than functions of interest! And we know that independently sampled uncorrelated noise will have very low mutual information with just about anything deterministic of interest.
Therefore we will do better if we search for a method which does not focus so much on norm of actual signal/function but on some statistical correspondence like... for example, cross correlation or covariance.
$endgroup$
An easy to explain example is if you have two sets of functions very corrupted by high energy noise. You want to find what parts / subsets / linear combinations of these correspond to each other the most.
If we just go for PCA it will optimize subspaces looking for dimensions of highest L2 norm in different senses, but if our noise has higher L2-norm than functions of interest it will rather select noise than functions of interest! And we know that independently sampled uncorrelated noise will have very low mutual information with just about anything deterministic of interest.
Therefore we will do better if we search for a method which does not focus so much on norm of actual signal/function but on some statistical correspondence like... for example, cross correlation or covariance.
answered Dec 1 '18 at 16:11
mathreadlermathreadler
14.8k72160
14.8k72160
$begingroup$
thank you very much for the reply. could you elaborate why "independently sampled uncorrelated noise will have very low mutual information" ? Does that mean that if I have a strong energy noise and a low energy, the mutual information between them would be low ? If that is the case, how does the mutual information term can be put into helping the optimization of $W$ ? could you write it in an arg max formula ? I really appreciate the explanation
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 10 '18 at 21:58
add a comment |
$begingroup$
thank you very much for the reply. could you elaborate why "independently sampled uncorrelated noise will have very low mutual information" ? Does that mean that if I have a strong energy noise and a low energy, the mutual information between them would be low ? If that is the case, how does the mutual information term can be put into helping the optimization of $W$ ? could you write it in an arg max formula ? I really appreciate the explanation
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 10 '18 at 21:58
$begingroup$
thank you very much for the reply. could you elaborate why "independently sampled uncorrelated noise will have very low mutual information" ? Does that mean that if I have a strong energy noise and a low energy, the mutual information between them would be low ? If that is the case, how does the mutual information term can be put into helping the optimization of $W$ ? could you write it in an arg max formula ? I really appreciate the explanation
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 10 '18 at 21:58
$begingroup$
thank you very much for the reply. could you elaborate why "independently sampled uncorrelated noise will have very low mutual information" ? Does that mean that if I have a strong energy noise and a low energy, the mutual information between them would be low ? If that is the case, how does the mutual information term can be put into helping the optimization of $W$ ? could you write it in an arg max formula ? I really appreciate the explanation
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 10 '18 at 21:58
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3021492%2fwhy-pca-is-a-suboptimal-of-maximization-of-mutual-information%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
The context of the quote is likely important here. Could you please link to the reference you draw this statement from?
$endgroup$
– stochasticboy321
Dec 3 '18 at 1:00
$begingroup$
@stochasticboy321 thanks for the comment. It is in the paper "IM algorithm : a variational approach to Information Maximization"
$endgroup$
– SoManyProb_for_a_broken_heart.
Dec 9 '18 at 22:13