how to measure correlation between two nonlinear timeseries datasets
I have two datasets with millions of y-values. (They are in chronological order so the X values have been omitted as they become merely an index.)
>>> print(f'y_one[0:16]...y_one[-1]')
[0, 1, 2, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 5, 6, 5]...302
>>> print(f'y_two[0:16]...y_two[-1]')
[0, -1, 0, 1, 2, 1, 2, 1, 2, 3, 2, 1, 2, 3, 2, 1]...88
Both datasets are generated at random and it should be assumed that they should only be correlated with each other as often than chance would allow. (They're both essentially random walks, either adding one to, or subtracting one from the previous y values).
>>> plt.plot(y_one[:80000])
>>> plt.show()
>>> plt.plot(y_two[:80000])
>>> plt.show()
My task is to analyze these two lines and determine if they are correlated more than chance would allow. If they are correlated more than random coinflips would suggest then something has happened to make them correlated and I need to trigger an event in response.
The thing is... I don't know where to start. As I've been reading up on correlation detection I have, of course, learned about the Pearson correlation coefficient. I've learned about its weaknesses; that it requires my data to be somewhat linear or monotonic in order for it to accurately detect a correlation.
As you can see, with both of these datasets being (supposedly) random walks they aren't typically monotonic. The images above show only the first 80k of data points and they're pretty much all over the board.
So the Pearson isn't sufficient; especially because they might be somewhat correlated for 10k or some other x-number-of-observations then drop their correlation altogether. That 10k stretch of correlation inside millions of data points doesn't mean anything to the Pearson. It gets lost in the noise. But if there are more of those correlation 'moments' than there should be I need to know about it - that's kind of what I'm looking for.
It's almost as if I need a 'calculus of Pearson correlation' as in, I need to make sliding windows of every size to slide over the datasets and calculate their Pearson coefficients for every possible x-number-of-observations in a row. If I did that and recorded my findings I could weight each Pearson value by how big the window is then average them to get an overall score. But with millions of observations, doing that explicitly is too computationally intensive: with 1 million observations I'd have to have 1 million-1 sliding windows, the smallest being 2 size and the largest being 1 million size; which would mean I'd have to calculate about 500,000,000,000 Pearson correlation coefficients.
All of this is compounded by the fact that the vast majority of what I'm dealing with is noise. But I know there's got to be a good way to measure the exact correlation between two datasets, that is sophisticated enough to work with these constraints but I don't know what it is or where to look. I'm trying to measure the slightest deviations from chance (of course there's a threshold of detail below which I don't care about, but it's pretty close to chance).
Has anyone dealt with a problem like this before? What do you suggest? Am I looking in the wrong place? Perhaps instead of straight correlation, I should instead fit these data points to a function (a polynomial curve) and then compare the two functions. What do you think?
Any advice would be appreciated, thanks!
Ps. I'm working in python if you couldn't tell, but I'd be happy to get advice in any language for how to do this computation.
python r correlation
add a comment |
I have two datasets with millions of y-values. (They are in chronological order so the X values have been omitted as they become merely an index.)
>>> print(f'y_one[0:16]...y_one[-1]')
[0, 1, 2, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 5, 6, 5]...302
>>> print(f'y_two[0:16]...y_two[-1]')
[0, -1, 0, 1, 2, 1, 2, 1, 2, 3, 2, 1, 2, 3, 2, 1]...88
Both datasets are generated at random and it should be assumed that they should only be correlated with each other as often than chance would allow. (They're both essentially random walks, either adding one to, or subtracting one from the previous y values).
>>> plt.plot(y_one[:80000])
>>> plt.show()
>>> plt.plot(y_two[:80000])
>>> plt.show()
My task is to analyze these two lines and determine if they are correlated more than chance would allow. If they are correlated more than random coinflips would suggest then something has happened to make them correlated and I need to trigger an event in response.
The thing is... I don't know where to start. As I've been reading up on correlation detection I have, of course, learned about the Pearson correlation coefficient. I've learned about its weaknesses; that it requires my data to be somewhat linear or monotonic in order for it to accurately detect a correlation.
As you can see, with both of these datasets being (supposedly) random walks they aren't typically monotonic. The images above show only the first 80k of data points and they're pretty much all over the board.
So the Pearson isn't sufficient; especially because they might be somewhat correlated for 10k or some other x-number-of-observations then drop their correlation altogether. That 10k stretch of correlation inside millions of data points doesn't mean anything to the Pearson. It gets lost in the noise. But if there are more of those correlation 'moments' than there should be I need to know about it - that's kind of what I'm looking for.
It's almost as if I need a 'calculus of Pearson correlation' as in, I need to make sliding windows of every size to slide over the datasets and calculate their Pearson coefficients for every possible x-number-of-observations in a row. If I did that and recorded my findings I could weight each Pearson value by how big the window is then average them to get an overall score. But with millions of observations, doing that explicitly is too computationally intensive: with 1 million observations I'd have to have 1 million-1 sliding windows, the smallest being 2 size and the largest being 1 million size; which would mean I'd have to calculate about 500,000,000,000 Pearson correlation coefficients.
All of this is compounded by the fact that the vast majority of what I'm dealing with is noise. But I know there's got to be a good way to measure the exact correlation between two datasets, that is sophisticated enough to work with these constraints but I don't know what it is or where to look. I'm trying to measure the slightest deviations from chance (of course there's a threshold of detail below which I don't care about, but it's pretty close to chance).
Has anyone dealt with a problem like this before? What do you suggest? Am I looking in the wrong place? Perhaps instead of straight correlation, I should instead fit these data points to a function (a polynomial curve) and then compare the two functions. What do you think?
Any advice would be appreciated, thanks!
Ps. I'm working in python if you couldn't tell, but I'd be happy to get advice in any language for how to do this computation.
python r correlation
1
I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.
– Rui Barradas
Nov 15 '18 at 7:31
@RuiBarradas I've already asked this question over there... no answer.
– Legit Stack
Nov 15 '18 at 7:32
1
The answer this this post might help you.
– b-fg
Nov 15 '18 at 8:22
2
There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?
– AkselA
Nov 15 '18 at 8:37
add a comment |
I have two datasets with millions of y-values. (They are in chronological order so the X values have been omitted as they become merely an index.)
>>> print(f'y_one[0:16]...y_one[-1]')
[0, 1, 2, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 5, 6, 5]...302
>>> print(f'y_two[0:16]...y_two[-1]')
[0, -1, 0, 1, 2, 1, 2, 1, 2, 3, 2, 1, 2, 3, 2, 1]...88
Both datasets are generated at random and it should be assumed that they should only be correlated with each other as often than chance would allow. (They're both essentially random walks, either adding one to, or subtracting one from the previous y values).
>>> plt.plot(y_one[:80000])
>>> plt.show()
>>> plt.plot(y_two[:80000])
>>> plt.show()
My task is to analyze these two lines and determine if they are correlated more than chance would allow. If they are correlated more than random coinflips would suggest then something has happened to make them correlated and I need to trigger an event in response.
The thing is... I don't know where to start. As I've been reading up on correlation detection I have, of course, learned about the Pearson correlation coefficient. I've learned about its weaknesses; that it requires my data to be somewhat linear or monotonic in order for it to accurately detect a correlation.
As you can see, with both of these datasets being (supposedly) random walks they aren't typically monotonic. The images above show only the first 80k of data points and they're pretty much all over the board.
So the Pearson isn't sufficient; especially because they might be somewhat correlated for 10k or some other x-number-of-observations then drop their correlation altogether. That 10k stretch of correlation inside millions of data points doesn't mean anything to the Pearson. It gets lost in the noise. But if there are more of those correlation 'moments' than there should be I need to know about it - that's kind of what I'm looking for.
It's almost as if I need a 'calculus of Pearson correlation' as in, I need to make sliding windows of every size to slide over the datasets and calculate their Pearson coefficients for every possible x-number-of-observations in a row. If I did that and recorded my findings I could weight each Pearson value by how big the window is then average them to get an overall score. But with millions of observations, doing that explicitly is too computationally intensive: with 1 million observations I'd have to have 1 million-1 sliding windows, the smallest being 2 size and the largest being 1 million size; which would mean I'd have to calculate about 500,000,000,000 Pearson correlation coefficients.
All of this is compounded by the fact that the vast majority of what I'm dealing with is noise. But I know there's got to be a good way to measure the exact correlation between two datasets, that is sophisticated enough to work with these constraints but I don't know what it is or where to look. I'm trying to measure the slightest deviations from chance (of course there's a threshold of detail below which I don't care about, but it's pretty close to chance).
Has anyone dealt with a problem like this before? What do you suggest? Am I looking in the wrong place? Perhaps instead of straight correlation, I should instead fit these data points to a function (a polynomial curve) and then compare the two functions. What do you think?
Any advice would be appreciated, thanks!
Ps. I'm working in python if you couldn't tell, but I'd be happy to get advice in any language for how to do this computation.
python r correlation
I have two datasets with millions of y-values. (They are in chronological order so the X values have been omitted as they become merely an index.)
>>> print(f'y_one[0:16]...y_one[-1]')
[0, 1, 2, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 5, 6, 5]...302
>>> print(f'y_two[0:16]...y_two[-1]')
[0, -1, 0, 1, 2, 1, 2, 1, 2, 3, 2, 1, 2, 3, 2, 1]...88
Both datasets are generated at random and it should be assumed that they should only be correlated with each other as often than chance would allow. (They're both essentially random walks, either adding one to, or subtracting one from the previous y values).
>>> plt.plot(y_one[:80000])
>>> plt.show()
>>> plt.plot(y_two[:80000])
>>> plt.show()
My task is to analyze these two lines and determine if they are correlated more than chance would allow. If they are correlated more than random coinflips would suggest then something has happened to make them correlated and I need to trigger an event in response.
The thing is... I don't know where to start. As I've been reading up on correlation detection I have, of course, learned about the Pearson correlation coefficient. I've learned about its weaknesses; that it requires my data to be somewhat linear or monotonic in order for it to accurately detect a correlation.
As you can see, with both of these datasets being (supposedly) random walks they aren't typically monotonic. The images above show only the first 80k of data points and they're pretty much all over the board.
So the Pearson isn't sufficient; especially because they might be somewhat correlated for 10k or some other x-number-of-observations then drop their correlation altogether. That 10k stretch of correlation inside millions of data points doesn't mean anything to the Pearson. It gets lost in the noise. But if there are more of those correlation 'moments' than there should be I need to know about it - that's kind of what I'm looking for.
It's almost as if I need a 'calculus of Pearson correlation' as in, I need to make sliding windows of every size to slide over the datasets and calculate their Pearson coefficients for every possible x-number-of-observations in a row. If I did that and recorded my findings I could weight each Pearson value by how big the window is then average them to get an overall score. But with millions of observations, doing that explicitly is too computationally intensive: with 1 million observations I'd have to have 1 million-1 sliding windows, the smallest being 2 size and the largest being 1 million size; which would mean I'd have to calculate about 500,000,000,000 Pearson correlation coefficients.
All of this is compounded by the fact that the vast majority of what I'm dealing with is noise. But I know there's got to be a good way to measure the exact correlation between two datasets, that is sophisticated enough to work with these constraints but I don't know what it is or where to look. I'm trying to measure the slightest deviations from chance (of course there's a threshold of detail below which I don't care about, but it's pretty close to chance).
Has anyone dealt with a problem like this before? What do you suggest? Am I looking in the wrong place? Perhaps instead of straight correlation, I should instead fit these data points to a function (a polynomial curve) and then compare the two functions. What do you think?
Any advice would be appreciated, thanks!
Ps. I'm working in python if you couldn't tell, but I'd be happy to get advice in any language for how to do this computation.
python r correlation
python r correlation
asked Nov 15 '18 at 7:26
Legit StackLegit Stack
708822
708822
1
I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.
– Rui Barradas
Nov 15 '18 at 7:31
@RuiBarradas I've already asked this question over there... no answer.
– Legit Stack
Nov 15 '18 at 7:32
1
The answer this this post might help you.
– b-fg
Nov 15 '18 at 8:22
2
There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?
– AkselA
Nov 15 '18 at 8:37
add a comment |
1
I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.
– Rui Barradas
Nov 15 '18 at 7:31
@RuiBarradas I've already asked this question over there... no answer.
– Legit Stack
Nov 15 '18 at 7:32
1
The answer this this post might help you.
– b-fg
Nov 15 '18 at 8:22
2
There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?
– AkselA
Nov 15 '18 at 8:37
1
1
I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.
– Rui Barradas
Nov 15 '18 at 7:31
I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.
– Rui Barradas
Nov 15 '18 at 7:31
@RuiBarradas I've already asked this question over there... no answer.
– Legit Stack
Nov 15 '18 at 7:32
@RuiBarradas I've already asked this question over there... no answer.
– Legit Stack
Nov 15 '18 at 7:32
1
1
The answer this this post might help you.
– b-fg
Nov 15 '18 at 8:22
The answer this this post might help you.
– b-fg
Nov 15 '18 at 8:22
2
2
There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?
– AkselA
Nov 15 '18 at 8:37
There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?
– AkselA
Nov 15 '18 at 8:37
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314348%2fhow-to-measure-correlation-between-two-nonlinear-timeseries-datasets%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314348%2fhow-to-measure-correlation-between-two-nonlinear-timeseries-datasets%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.
– Rui Barradas
Nov 15 '18 at 7:31
@RuiBarradas I've already asked this question over there... no answer.
– Legit Stack
Nov 15 '18 at 7:32
1
The answer this this post might help you.
– b-fg
Nov 15 '18 at 8:22
2
There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?
– AkselA
Nov 15 '18 at 8:37