how to measure correlation between two nonlinear timeseries datasets










0















I have two datasets with millions of y-values. (They are in chronological order so the X values have been omitted as they become merely an index.)



>>> print(f'y_one[0:16]...y_one[-1]')
[0, 1, 2, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 5, 6, 5]...302

>>> print(f'y_two[0:16]...y_two[-1]')
[0, -1, 0, 1, 2, 1, 2, 1, 2, 3, 2, 1, 2, 3, 2, 1]...88


Both datasets are generated at random and it should be assumed that they should only be correlated with each other as often than chance would allow. (They're both essentially random walks, either adding one to, or subtracting one from the previous y values).



>>> plt.plot(y_one[:80000])
>>> plt.show()


y_one



>>> plt.plot(y_two[:80000])
>>> plt.show()


y_two



My task is to analyze these two lines and determine if they are correlated more than chance would allow. If they are correlated more than random coinflips would suggest then something has happened to make them correlated and I need to trigger an event in response.



The thing is... I don't know where to start. As I've been reading up on correlation detection I have, of course, learned about the Pearson correlation coefficient. I've learned about its weaknesses; that it requires my data to be somewhat linear or monotonic in order for it to accurately detect a correlation.



As you can see, with both of these datasets being (supposedly) random walks they aren't typically monotonic. The images above show only the first 80k of data points and they're pretty much all over the board.



So the Pearson isn't sufficient; especially because they might be somewhat correlated for 10k or some other x-number-of-observations then drop their correlation altogether. That 10k stretch of correlation inside millions of data points doesn't mean anything to the Pearson. It gets lost in the noise. But if there are more of those correlation 'moments' than there should be I need to know about it - that's kind of what I'm looking for.



It's almost as if I need a 'calculus of Pearson correlation' as in, I need to make sliding windows of every size to slide over the datasets and calculate their Pearson coefficients for every possible x-number-of-observations in a row. If I did that and recorded my findings I could weight each Pearson value by how big the window is then average them to get an overall score. But with millions of observations, doing that explicitly is too computationally intensive: with 1 million observations I'd have to have 1 million-1 sliding windows, the smallest being 2 size and the largest being 1 million size; which would mean I'd have to calculate about 500,000,000,000 Pearson correlation coefficients.



All of this is compounded by the fact that the vast majority of what I'm dealing with is noise. But I know there's got to be a good way to measure the exact correlation between two datasets, that is sophisticated enough to work with these constraints but I don't know what it is or where to look. I'm trying to measure the slightest deviations from chance (of course there's a threshold of detail below which I don't care about, but it's pretty close to chance).



Has anyone dealt with a problem like this before? What do you suggest? Am I looking in the wrong place? Perhaps instead of straight correlation, I should instead fit these data points to a function (a polynomial curve) and then compare the two functions. What do you think?



Any advice would be appreciated, thanks!



Ps. I'm working in python if you couldn't tell, but I'd be happy to get advice in any language for how to do this computation.










share|improve this question

















  • 1





    I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.

    – Rui Barradas
    Nov 15 '18 at 7:31











  • @RuiBarradas I've already asked this question over there... no answer.

    – Legit Stack
    Nov 15 '18 at 7:32






  • 1





    The answer this this post might help you.

    – b-fg
    Nov 15 '18 at 8:22






  • 2





    There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?

    – AkselA
    Nov 15 '18 at 8:37
















0















I have two datasets with millions of y-values. (They are in chronological order so the X values have been omitted as they become merely an index.)



>>> print(f'y_one[0:16]...y_one[-1]')
[0, 1, 2, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 5, 6, 5]...302

>>> print(f'y_two[0:16]...y_two[-1]')
[0, -1, 0, 1, 2, 1, 2, 1, 2, 3, 2, 1, 2, 3, 2, 1]...88


Both datasets are generated at random and it should be assumed that they should only be correlated with each other as often than chance would allow. (They're both essentially random walks, either adding one to, or subtracting one from the previous y values).



>>> plt.plot(y_one[:80000])
>>> plt.show()


y_one



>>> plt.plot(y_two[:80000])
>>> plt.show()


y_two



My task is to analyze these two lines and determine if they are correlated more than chance would allow. If they are correlated more than random coinflips would suggest then something has happened to make them correlated and I need to trigger an event in response.



The thing is... I don't know where to start. As I've been reading up on correlation detection I have, of course, learned about the Pearson correlation coefficient. I've learned about its weaknesses; that it requires my data to be somewhat linear or monotonic in order for it to accurately detect a correlation.



As you can see, with both of these datasets being (supposedly) random walks they aren't typically monotonic. The images above show only the first 80k of data points and they're pretty much all over the board.



So the Pearson isn't sufficient; especially because they might be somewhat correlated for 10k or some other x-number-of-observations then drop their correlation altogether. That 10k stretch of correlation inside millions of data points doesn't mean anything to the Pearson. It gets lost in the noise. But if there are more of those correlation 'moments' than there should be I need to know about it - that's kind of what I'm looking for.



It's almost as if I need a 'calculus of Pearson correlation' as in, I need to make sliding windows of every size to slide over the datasets and calculate their Pearson coefficients for every possible x-number-of-observations in a row. If I did that and recorded my findings I could weight each Pearson value by how big the window is then average them to get an overall score. But with millions of observations, doing that explicitly is too computationally intensive: with 1 million observations I'd have to have 1 million-1 sliding windows, the smallest being 2 size and the largest being 1 million size; which would mean I'd have to calculate about 500,000,000,000 Pearson correlation coefficients.



All of this is compounded by the fact that the vast majority of what I'm dealing with is noise. But I know there's got to be a good way to measure the exact correlation between two datasets, that is sophisticated enough to work with these constraints but I don't know what it is or where to look. I'm trying to measure the slightest deviations from chance (of course there's a threshold of detail below which I don't care about, but it's pretty close to chance).



Has anyone dealt with a problem like this before? What do you suggest? Am I looking in the wrong place? Perhaps instead of straight correlation, I should instead fit these data points to a function (a polynomial curve) and then compare the two functions. What do you think?



Any advice would be appreciated, thanks!



Ps. I'm working in python if you couldn't tell, but I'd be happy to get advice in any language for how to do this computation.










share|improve this question

















  • 1





    I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.

    – Rui Barradas
    Nov 15 '18 at 7:31











  • @RuiBarradas I've already asked this question over there... no answer.

    – Legit Stack
    Nov 15 '18 at 7:32






  • 1





    The answer this this post might help you.

    – b-fg
    Nov 15 '18 at 8:22






  • 2





    There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?

    – AkselA
    Nov 15 '18 at 8:37














0












0








0








I have two datasets with millions of y-values. (They are in chronological order so the X values have been omitted as they become merely an index.)



>>> print(f'y_one[0:16]...y_one[-1]')
[0, 1, 2, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 5, 6, 5]...302

>>> print(f'y_two[0:16]...y_two[-1]')
[0, -1, 0, 1, 2, 1, 2, 1, 2, 3, 2, 1, 2, 3, 2, 1]...88


Both datasets are generated at random and it should be assumed that they should only be correlated with each other as often than chance would allow. (They're both essentially random walks, either adding one to, or subtracting one from the previous y values).



>>> plt.plot(y_one[:80000])
>>> plt.show()


y_one



>>> plt.plot(y_two[:80000])
>>> plt.show()


y_two



My task is to analyze these two lines and determine if they are correlated more than chance would allow. If they are correlated more than random coinflips would suggest then something has happened to make them correlated and I need to trigger an event in response.



The thing is... I don't know where to start. As I've been reading up on correlation detection I have, of course, learned about the Pearson correlation coefficient. I've learned about its weaknesses; that it requires my data to be somewhat linear or monotonic in order for it to accurately detect a correlation.



As you can see, with both of these datasets being (supposedly) random walks they aren't typically monotonic. The images above show only the first 80k of data points and they're pretty much all over the board.



So the Pearson isn't sufficient; especially because they might be somewhat correlated for 10k or some other x-number-of-observations then drop their correlation altogether. That 10k stretch of correlation inside millions of data points doesn't mean anything to the Pearson. It gets lost in the noise. But if there are more of those correlation 'moments' than there should be I need to know about it - that's kind of what I'm looking for.



It's almost as if I need a 'calculus of Pearson correlation' as in, I need to make sliding windows of every size to slide over the datasets and calculate their Pearson coefficients for every possible x-number-of-observations in a row. If I did that and recorded my findings I could weight each Pearson value by how big the window is then average them to get an overall score. But with millions of observations, doing that explicitly is too computationally intensive: with 1 million observations I'd have to have 1 million-1 sliding windows, the smallest being 2 size and the largest being 1 million size; which would mean I'd have to calculate about 500,000,000,000 Pearson correlation coefficients.



All of this is compounded by the fact that the vast majority of what I'm dealing with is noise. But I know there's got to be a good way to measure the exact correlation between two datasets, that is sophisticated enough to work with these constraints but I don't know what it is or where to look. I'm trying to measure the slightest deviations from chance (of course there's a threshold of detail below which I don't care about, but it's pretty close to chance).



Has anyone dealt with a problem like this before? What do you suggest? Am I looking in the wrong place? Perhaps instead of straight correlation, I should instead fit these data points to a function (a polynomial curve) and then compare the two functions. What do you think?



Any advice would be appreciated, thanks!



Ps. I'm working in python if you couldn't tell, but I'd be happy to get advice in any language for how to do this computation.










share|improve this question














I have two datasets with millions of y-values. (They are in chronological order so the X values have been omitted as they become merely an index.)



>>> print(f'y_one[0:16]...y_one[-1]')
[0, 1, 2, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 5, 6, 5]...302

>>> print(f'y_two[0:16]...y_two[-1]')
[0, -1, 0, 1, 2, 1, 2, 1, 2, 3, 2, 1, 2, 3, 2, 1]...88


Both datasets are generated at random and it should be assumed that they should only be correlated with each other as often than chance would allow. (They're both essentially random walks, either adding one to, or subtracting one from the previous y values).



>>> plt.plot(y_one[:80000])
>>> plt.show()


y_one



>>> plt.plot(y_two[:80000])
>>> plt.show()


y_two



My task is to analyze these two lines and determine if they are correlated more than chance would allow. If they are correlated more than random coinflips would suggest then something has happened to make them correlated and I need to trigger an event in response.



The thing is... I don't know where to start. As I've been reading up on correlation detection I have, of course, learned about the Pearson correlation coefficient. I've learned about its weaknesses; that it requires my data to be somewhat linear or monotonic in order for it to accurately detect a correlation.



As you can see, with both of these datasets being (supposedly) random walks they aren't typically monotonic. The images above show only the first 80k of data points and they're pretty much all over the board.



So the Pearson isn't sufficient; especially because they might be somewhat correlated for 10k or some other x-number-of-observations then drop their correlation altogether. That 10k stretch of correlation inside millions of data points doesn't mean anything to the Pearson. It gets lost in the noise. But if there are more of those correlation 'moments' than there should be I need to know about it - that's kind of what I'm looking for.



It's almost as if I need a 'calculus of Pearson correlation' as in, I need to make sliding windows of every size to slide over the datasets and calculate their Pearson coefficients for every possible x-number-of-observations in a row. If I did that and recorded my findings I could weight each Pearson value by how big the window is then average them to get an overall score. But with millions of observations, doing that explicitly is too computationally intensive: with 1 million observations I'd have to have 1 million-1 sliding windows, the smallest being 2 size and the largest being 1 million size; which would mean I'd have to calculate about 500,000,000,000 Pearson correlation coefficients.



All of this is compounded by the fact that the vast majority of what I'm dealing with is noise. But I know there's got to be a good way to measure the exact correlation between two datasets, that is sophisticated enough to work with these constraints but I don't know what it is or where to look. I'm trying to measure the slightest deviations from chance (of course there's a threshold of detail below which I don't care about, but it's pretty close to chance).



Has anyone dealt with a problem like this before? What do you suggest? Am I looking in the wrong place? Perhaps instead of straight correlation, I should instead fit these data points to a function (a polynomial curve) and then compare the two functions. What do you think?



Any advice would be appreciated, thanks!



Ps. I'm working in python if you couldn't tell, but I'd be happy to get advice in any language for how to do this computation.







python r correlation






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 7:26









Legit StackLegit Stack

708822




708822







  • 1





    I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.

    – Rui Barradas
    Nov 15 '18 at 7:31











  • @RuiBarradas I've already asked this question over there... no answer.

    – Legit Stack
    Nov 15 '18 at 7:32






  • 1





    The answer this this post might help you.

    – b-fg
    Nov 15 '18 at 8:22






  • 2





    There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?

    – AkselA
    Nov 15 '18 at 8:37













  • 1





    I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.

    – Rui Barradas
    Nov 15 '18 at 7:31











  • @RuiBarradas I've already asked this question over there... no answer.

    – Legit Stack
    Nov 15 '18 at 7:32






  • 1





    The answer this this post might help you.

    – b-fg
    Nov 15 '18 at 8:22






  • 2





    There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?

    – AkselA
    Nov 15 '18 at 8:37








1




1





I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.

– Rui Barradas
Nov 15 '18 at 7:31





I have voted to migrate this to Cross Validated, since it seems to be more a question about statistics than about R code.

– Rui Barradas
Nov 15 '18 at 7:31













@RuiBarradas I've already asked this question over there... no answer.

– Legit Stack
Nov 15 '18 at 7:32





@RuiBarradas I've already asked this question over there... no answer.

– Legit Stack
Nov 15 '18 at 7:32




1




1





The answer this this post might help you.

– b-fg
Nov 15 '18 at 8:22





The answer this this post might help you.

– b-fg
Nov 15 '18 at 8:22




2




2





There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?

– AkselA
Nov 15 '18 at 8:37






There's not much sense in calculating the correlation between two random walks, as per this question. If you are interested in the relation between two seemingly random walks, shouldn't that relation remain after taking the first difference of each series?

– AkselA
Nov 15 '18 at 8:37













0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314348%2fhow-to-measure-correlation-between-two-nonlinear-timeseries-datasets%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314348%2fhow-to-measure-correlation-between-two-nonlinear-timeseries-datasets%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

Node.js Script on GitHub Pages or Amazon S3

Museum of Modern and Contemporary Art of Trento and Rovereto