What is a simple way to check if two git repositories are unrelated?









up vote
0
down vote

favorite












What is a simple way to check if two git repositories are unrelated?



For example let's assume we cloned following repositories:



  • https://github.com/spring-petclinic/spring-framework-petclinic

  • https://github.com/spring-projects/spring-petclinic

How can I check that one doesn't share history with another?



How can I check that one share partial history with another? How can I browse common DAG and view difference?



Note: Git allows shallow copy + repositories history can diverge with a time...










share|improve this question





















  • Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
    – Tim Biegeleisen
    Nov 11 at 10:52











  • You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
    – Philippe
    Nov 11 at 12:05










  • Here, we could see that they are related...
    – Philippe
    Nov 11 at 12:12










  • @Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
    – gavenkoa
    Nov 11 at 13:16















up vote
0
down vote

favorite












What is a simple way to check if two git repositories are unrelated?



For example let's assume we cloned following repositories:



  • https://github.com/spring-petclinic/spring-framework-petclinic

  • https://github.com/spring-projects/spring-petclinic

How can I check that one doesn't share history with another?



How can I check that one share partial history with another? How can I browse common DAG and view difference?



Note: Git allows shallow copy + repositories history can diverge with a time...










share|improve this question





















  • Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
    – Tim Biegeleisen
    Nov 11 at 10:52











  • You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
    – Philippe
    Nov 11 at 12:05










  • Here, we could see that they are related...
    – Philippe
    Nov 11 at 12:12










  • @Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
    – gavenkoa
    Nov 11 at 13:16













up vote
0
down vote

favorite









up vote
0
down vote

favorite











What is a simple way to check if two git repositories are unrelated?



For example let's assume we cloned following repositories:



  • https://github.com/spring-petclinic/spring-framework-petclinic

  • https://github.com/spring-projects/spring-petclinic

How can I check that one doesn't share history with another?



How can I check that one share partial history with another? How can I browse common DAG and view difference?



Note: Git allows shallow copy + repositories history can diverge with a time...










share|improve this question













What is a simple way to check if two git repositories are unrelated?



For example let's assume we cloned following repositories:



  • https://github.com/spring-petclinic/spring-framework-petclinic

  • https://github.com/spring-projects/spring-petclinic

How can I check that one doesn't share history with another?



How can I check that one share partial history with another? How can I browse common DAG and view difference?



Note: Git allows shallow copy + repositories history can diverge with a time...







git






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 11 at 10:44









gavenkoa

22.3k8138182




22.3k8138182











  • Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
    – Tim Biegeleisen
    Nov 11 at 10:52











  • You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
    – Philippe
    Nov 11 at 12:05










  • Here, we could see that they are related...
    – Philippe
    Nov 11 at 12:12










  • @Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
    – gavenkoa
    Nov 11 at 13:16

















  • Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
    – Tim Biegeleisen
    Nov 11 at 10:52











  • You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
    – Philippe
    Nov 11 at 12:05










  • Here, we could see that they are related...
    – Philippe
    Nov 11 at 12:12










  • @Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
    – gavenkoa
    Nov 11 at 13:16
















Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
– Tim Biegeleisen
Nov 11 at 10:52





Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
– Tim Biegeleisen
Nov 11 at 10:52













You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
– Philippe
Nov 11 at 12:05




You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
– Philippe
Nov 11 at 12:05












Here, we could see that they are related...
– Philippe
Nov 11 at 12:12




Here, we could see that they are related...
– Philippe
Nov 11 at 12:12












@Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
– gavenkoa
Nov 11 at 13:16





@Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
– gavenkoa
Nov 11 at 13:16













3 Answers
3






active

oldest

votes

















up vote
3
down vote













You can clone the first one, then add the second one as a additional remote:



git clone https://github.com/spring-petclinic/spring-framework-petclinic
cd spring-framework-petclinic
git remote add other https://github.com/spring-projects/spring-petclinic
git fetch --all


Then you can browse both DAGs:



git log --graph --all --oneline --decorate


And see if they have any common history by looking at the merge base of two trunks:



git merge-base origin/master other/master





share|improve this answer






















  • How can I clean repository after git fetch --all other?
    – gavenkoa
    Nov 11 at 13:19






  • 1




    You can simply remove the remote: git remote rm other, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc: git -c gc.reflogExpire=now gc --prune=all (WARNING: this will remove ALL stale objects)
    – zigarn
    Nov 12 at 10:10

















up vote
1
down vote













It's easy to prove that two repositories are related: if they contain matching commits—commits with the same hash IDs and contents, although "same hash IDs" is usually sufficient1—they are related.



As you note, it's much harder to prove that they are un-related unless they are both complete (non-shallow, non-single-branch) clones. If both are complete, yet neither has any commit in common with the other, the two repositories are unrelated.



If you have both repositories and verify that both are complete, simply enumerate all the commit hash IDs in both repositories and look for common IDs. If common IDs exist, the two are probably related. To enumerate all the commit hash IDs, run git rev-list --all (redirect output to file or to program that reads both sets of outputs and checks for common hash IDs).



See footnote 1 for eliminating "probably", but the TL;DR is that for now, any two identical IDs means shared history.




1Given a uniform hash function h(k) whose range is r = |{h(k)|, the probability of an accidental collision of hashes for two distinct keys k1, k2 is p = 1/r. The probability of uniqueness is the complement of this is p̄ = 1 - (1/r). Generalizing to n keys and using a two-term Taylor expansion of ex ≈ 1 + x for x ≪ 1, we get p̄ ≈ e(-n(n-1)) / 2r, as long as r is reasonably large.



Git's hash function is currently SHA1, which has a pretty uniform distribution and has r = 2160 = 1461501637330902918203684832716283019655932542976. This satisfies our formula, which means we can use the approximation.



Hence, if you sum up the total number of hashes n and plug it in to the formula:



r=1461501637330902918203684832716283019655932542976
1 - exp(-n*(n-1)/(2*r))


(remember, we want p, not p-bar),
you get the probability of a collision. To check for an actual collision, of course, you can just compare the underlying actual objects: if the hashes match, compare the objects directly to detect a collision. But it's extremely unlikely in the first place. If we take two repositories that, together, contain ten million commits, we compute:



$ bc -l
r=2^160
n=10*1000*1000
scale=100
1 - e(-n*(n-1)/(2*r))
.0000000000000000000000000000000000342113848680412753525884397196522
895097282878872708411144841034243


which as you can see is still pretty tiny. It's not until we get to:



n=10*1000*1000*1000*1000*1000*1000*1000


(ten sextillion objects, using the short scale notation)



that we find:



1 - e(-n*(n-1)/(2*r))
.0000342108030863093209851036344159518189002166758764416221121344549
079733424124497666779807175655625


a noticeable chance of accidental collision, at about 0.0035%. At 100 sextillion objects we are up to a 0.35% chance:



n=100*1000*1000*1000*1000*1000*1000*1000
1 - e(-n*(n-1)/(2*r))
.0034152934013810288444649336362559390942558976421984312395097770719
923433072593638116228277476790795


and by 1 septillion objects we're running some serious risks:



1 - e(-n*(n-1)/(2*r))
.2897326871923714506502211457721853341644126909116947422293621066225
555385326652788789421475224989232


Fortunately, well before then, we've run out of disk space. :-) Also, the Git guys are thinking of moving to one of the SHA-256 hashes, which will raise r to 2256, which helps out our denominator.



(I'm using bc above, in which ^ is exponentiation and the -l library adds e(x) to compute ex.)






share|improve this answer




















  • Great! But you forget to add practical examples, like both git rev-list --all can be sorted and passed to comm -12 repo1.revs repo2.revs ))
    – gavenkoa
    Nov 12 at 0:00






  • 1




    Well, that assumes bash, sort, and comm. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
    – torek
    Nov 12 at 0:18

















up vote
0
down vote













To find any common revisions:



comm -12 <(cd repo1; git rev-list --all | sort) <(cd repo2; git rev-list --all | sort)


In my case both repositories share commits:



bash# (cd spring-petclinic; git rev-list --all) | wc -l
670

bash# (cd spring-framework-petclinic; git rev-list --all) | wc -l
571

bash# comm -12 <(cd spring-framework-petclinic/; git rev-list --all | sort) <(cd spring-petclinic; git rev-list --all | sort) | wc -l
427





share|improve this answer




















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53247953%2fwhat-is-a-simple-way-to-check-if-two-git-repositories-are-unrelated%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    3
    down vote













    You can clone the first one, then add the second one as a additional remote:



    git clone https://github.com/spring-petclinic/spring-framework-petclinic
    cd spring-framework-petclinic
    git remote add other https://github.com/spring-projects/spring-petclinic
    git fetch --all


    Then you can browse both DAGs:



    git log --graph --all --oneline --decorate


    And see if they have any common history by looking at the merge base of two trunks:



    git merge-base origin/master other/master





    share|improve this answer






















    • How can I clean repository after git fetch --all other?
      – gavenkoa
      Nov 11 at 13:19






    • 1




      You can simply remove the remote: git remote rm other, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc: git -c gc.reflogExpire=now gc --prune=all (WARNING: this will remove ALL stale objects)
      – zigarn
      Nov 12 at 10:10














    up vote
    3
    down vote













    You can clone the first one, then add the second one as a additional remote:



    git clone https://github.com/spring-petclinic/spring-framework-petclinic
    cd spring-framework-petclinic
    git remote add other https://github.com/spring-projects/spring-petclinic
    git fetch --all


    Then you can browse both DAGs:



    git log --graph --all --oneline --decorate


    And see if they have any common history by looking at the merge base of two trunks:



    git merge-base origin/master other/master





    share|improve this answer






















    • How can I clean repository after git fetch --all other?
      – gavenkoa
      Nov 11 at 13:19






    • 1




      You can simply remove the remote: git remote rm other, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc: git -c gc.reflogExpire=now gc --prune=all (WARNING: this will remove ALL stale objects)
      – zigarn
      Nov 12 at 10:10












    up vote
    3
    down vote










    up vote
    3
    down vote









    You can clone the first one, then add the second one as a additional remote:



    git clone https://github.com/spring-petclinic/spring-framework-petclinic
    cd spring-framework-petclinic
    git remote add other https://github.com/spring-projects/spring-petclinic
    git fetch --all


    Then you can browse both DAGs:



    git log --graph --all --oneline --decorate


    And see if they have any common history by looking at the merge base of two trunks:



    git merge-base origin/master other/master





    share|improve this answer














    You can clone the first one, then add the second one as a additional remote:



    git clone https://github.com/spring-petclinic/spring-framework-petclinic
    cd spring-framework-petclinic
    git remote add other https://github.com/spring-projects/spring-petclinic
    git fetch --all


    Then you can browse both DAGs:



    git log --graph --all --oneline --decorate


    And see if they have any common history by looking at the merge base of two trunks:



    git merge-base origin/master other/master






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 12 at 10:13

























    answered Nov 11 at 11:21









    zigarn

    5,10611628




    5,10611628











    • How can I clean repository after git fetch --all other?
      – gavenkoa
      Nov 11 at 13:19






    • 1




      You can simply remove the remote: git remote rm other, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc: git -c gc.reflogExpire=now gc --prune=all (WARNING: this will remove ALL stale objects)
      – zigarn
      Nov 12 at 10:10
















    • How can I clean repository after git fetch --all other?
      – gavenkoa
      Nov 11 at 13:19






    • 1




      You can simply remove the remote: git remote rm other, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc: git -c gc.reflogExpire=now gc --prune=all (WARNING: this will remove ALL stale objects)
      – zigarn
      Nov 12 at 10:10















    How can I clean repository after git fetch --all other?
    – gavenkoa
    Nov 11 at 13:19




    How can I clean repository after git fetch --all other?
    – gavenkoa
    Nov 11 at 13:19




    1




    1




    You can simply remove the remote: git remote rm other, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc: git -c gc.reflogExpire=now gc --prune=all (WARNING: this will remove ALL stale objects)
    – zigarn
    Nov 12 at 10:10




    You can simply remove the remote: git remote rm other, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc: git -c gc.reflogExpire=now gc --prune=all (WARNING: this will remove ALL stale objects)
    – zigarn
    Nov 12 at 10:10












    up vote
    1
    down vote













    It's easy to prove that two repositories are related: if they contain matching commits—commits with the same hash IDs and contents, although "same hash IDs" is usually sufficient1—they are related.



    As you note, it's much harder to prove that they are un-related unless they are both complete (non-shallow, non-single-branch) clones. If both are complete, yet neither has any commit in common with the other, the two repositories are unrelated.



    If you have both repositories and verify that both are complete, simply enumerate all the commit hash IDs in both repositories and look for common IDs. If common IDs exist, the two are probably related. To enumerate all the commit hash IDs, run git rev-list --all (redirect output to file or to program that reads both sets of outputs and checks for common hash IDs).



    See footnote 1 for eliminating "probably", but the TL;DR is that for now, any two identical IDs means shared history.




    1Given a uniform hash function h(k) whose range is r = |{h(k)|, the probability of an accidental collision of hashes for two distinct keys k1, k2 is p = 1/r. The probability of uniqueness is the complement of this is p̄ = 1 - (1/r). Generalizing to n keys and using a two-term Taylor expansion of ex ≈ 1 + x for x ≪ 1, we get p̄ ≈ e(-n(n-1)) / 2r, as long as r is reasonably large.



    Git's hash function is currently SHA1, which has a pretty uniform distribution and has r = 2160 = 1461501637330902918203684832716283019655932542976. This satisfies our formula, which means we can use the approximation.



    Hence, if you sum up the total number of hashes n and plug it in to the formula:



    r=1461501637330902918203684832716283019655932542976
    1 - exp(-n*(n-1)/(2*r))


    (remember, we want p, not p-bar),
    you get the probability of a collision. To check for an actual collision, of course, you can just compare the underlying actual objects: if the hashes match, compare the objects directly to detect a collision. But it's extremely unlikely in the first place. If we take two repositories that, together, contain ten million commits, we compute:



    $ bc -l
    r=2^160
    n=10*1000*1000
    scale=100
    1 - e(-n*(n-1)/(2*r))
    .0000000000000000000000000000000000342113848680412753525884397196522
    895097282878872708411144841034243


    which as you can see is still pretty tiny. It's not until we get to:



    n=10*1000*1000*1000*1000*1000*1000*1000


    (ten sextillion objects, using the short scale notation)



    that we find:



    1 - e(-n*(n-1)/(2*r))
    .0000342108030863093209851036344159518189002166758764416221121344549
    079733424124497666779807175655625


    a noticeable chance of accidental collision, at about 0.0035%. At 100 sextillion objects we are up to a 0.35% chance:



    n=100*1000*1000*1000*1000*1000*1000*1000
    1 - e(-n*(n-1)/(2*r))
    .0034152934013810288444649336362559390942558976421984312395097770719
    923433072593638116228277476790795


    and by 1 septillion objects we're running some serious risks:



    1 - e(-n*(n-1)/(2*r))
    .2897326871923714506502211457721853341644126909116947422293621066225
    555385326652788789421475224989232


    Fortunately, well before then, we've run out of disk space. :-) Also, the Git guys are thinking of moving to one of the SHA-256 hashes, which will raise r to 2256, which helps out our denominator.



    (I'm using bc above, in which ^ is exponentiation and the -l library adds e(x) to compute ex.)






    share|improve this answer




















    • Great! But you forget to add practical examples, like both git rev-list --all can be sorted and passed to comm -12 repo1.revs repo2.revs ))
      – gavenkoa
      Nov 12 at 0:00






    • 1




      Well, that assumes bash, sort, and comm. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
      – torek
      Nov 12 at 0:18














    up vote
    1
    down vote













    It's easy to prove that two repositories are related: if they contain matching commits—commits with the same hash IDs and contents, although "same hash IDs" is usually sufficient1—they are related.



    As you note, it's much harder to prove that they are un-related unless they are both complete (non-shallow, non-single-branch) clones. If both are complete, yet neither has any commit in common with the other, the two repositories are unrelated.



    If you have both repositories and verify that both are complete, simply enumerate all the commit hash IDs in both repositories and look for common IDs. If common IDs exist, the two are probably related. To enumerate all the commit hash IDs, run git rev-list --all (redirect output to file or to program that reads both sets of outputs and checks for common hash IDs).



    See footnote 1 for eliminating "probably", but the TL;DR is that for now, any two identical IDs means shared history.




    1Given a uniform hash function h(k) whose range is r = |{h(k)|, the probability of an accidental collision of hashes for two distinct keys k1, k2 is p = 1/r. The probability of uniqueness is the complement of this is p̄ = 1 - (1/r). Generalizing to n keys and using a two-term Taylor expansion of ex ≈ 1 + x for x ≪ 1, we get p̄ ≈ e(-n(n-1)) / 2r, as long as r is reasonably large.



    Git's hash function is currently SHA1, which has a pretty uniform distribution and has r = 2160 = 1461501637330902918203684832716283019655932542976. This satisfies our formula, which means we can use the approximation.



    Hence, if you sum up the total number of hashes n and plug it in to the formula:



    r=1461501637330902918203684832716283019655932542976
    1 - exp(-n*(n-1)/(2*r))


    (remember, we want p, not p-bar),
    you get the probability of a collision. To check for an actual collision, of course, you can just compare the underlying actual objects: if the hashes match, compare the objects directly to detect a collision. But it's extremely unlikely in the first place. If we take two repositories that, together, contain ten million commits, we compute:



    $ bc -l
    r=2^160
    n=10*1000*1000
    scale=100
    1 - e(-n*(n-1)/(2*r))
    .0000000000000000000000000000000000342113848680412753525884397196522
    895097282878872708411144841034243


    which as you can see is still pretty tiny. It's not until we get to:



    n=10*1000*1000*1000*1000*1000*1000*1000


    (ten sextillion objects, using the short scale notation)



    that we find:



    1 - e(-n*(n-1)/(2*r))
    .0000342108030863093209851036344159518189002166758764416221121344549
    079733424124497666779807175655625


    a noticeable chance of accidental collision, at about 0.0035%. At 100 sextillion objects we are up to a 0.35% chance:



    n=100*1000*1000*1000*1000*1000*1000*1000
    1 - e(-n*(n-1)/(2*r))
    .0034152934013810288444649336362559390942558976421984312395097770719
    923433072593638116228277476790795


    and by 1 septillion objects we're running some serious risks:



    1 - e(-n*(n-1)/(2*r))
    .2897326871923714506502211457721853341644126909116947422293621066225
    555385326652788789421475224989232


    Fortunately, well before then, we've run out of disk space. :-) Also, the Git guys are thinking of moving to one of the SHA-256 hashes, which will raise r to 2256, which helps out our denominator.



    (I'm using bc above, in which ^ is exponentiation and the -l library adds e(x) to compute ex.)






    share|improve this answer




















    • Great! But you forget to add practical examples, like both git rev-list --all can be sorted and passed to comm -12 repo1.revs repo2.revs ))
      – gavenkoa
      Nov 12 at 0:00






    • 1




      Well, that assumes bash, sort, and comm. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
      – torek
      Nov 12 at 0:18












    up vote
    1
    down vote










    up vote
    1
    down vote









    It's easy to prove that two repositories are related: if they contain matching commits—commits with the same hash IDs and contents, although "same hash IDs" is usually sufficient1—they are related.



    As you note, it's much harder to prove that they are un-related unless they are both complete (non-shallow, non-single-branch) clones. If both are complete, yet neither has any commit in common with the other, the two repositories are unrelated.



    If you have both repositories and verify that both are complete, simply enumerate all the commit hash IDs in both repositories and look for common IDs. If common IDs exist, the two are probably related. To enumerate all the commit hash IDs, run git rev-list --all (redirect output to file or to program that reads both sets of outputs and checks for common hash IDs).



    See footnote 1 for eliminating "probably", but the TL;DR is that for now, any two identical IDs means shared history.




    1Given a uniform hash function h(k) whose range is r = |{h(k)|, the probability of an accidental collision of hashes for two distinct keys k1, k2 is p = 1/r. The probability of uniqueness is the complement of this is p̄ = 1 - (1/r). Generalizing to n keys and using a two-term Taylor expansion of ex ≈ 1 + x for x ≪ 1, we get p̄ ≈ e(-n(n-1)) / 2r, as long as r is reasonably large.



    Git's hash function is currently SHA1, which has a pretty uniform distribution and has r = 2160 = 1461501637330902918203684832716283019655932542976. This satisfies our formula, which means we can use the approximation.



    Hence, if you sum up the total number of hashes n and plug it in to the formula:



    r=1461501637330902918203684832716283019655932542976
    1 - exp(-n*(n-1)/(2*r))


    (remember, we want p, not p-bar),
    you get the probability of a collision. To check for an actual collision, of course, you can just compare the underlying actual objects: if the hashes match, compare the objects directly to detect a collision. But it's extremely unlikely in the first place. If we take two repositories that, together, contain ten million commits, we compute:



    $ bc -l
    r=2^160
    n=10*1000*1000
    scale=100
    1 - e(-n*(n-1)/(2*r))
    .0000000000000000000000000000000000342113848680412753525884397196522
    895097282878872708411144841034243


    which as you can see is still pretty tiny. It's not until we get to:



    n=10*1000*1000*1000*1000*1000*1000*1000


    (ten sextillion objects, using the short scale notation)



    that we find:



    1 - e(-n*(n-1)/(2*r))
    .0000342108030863093209851036344159518189002166758764416221121344549
    079733424124497666779807175655625


    a noticeable chance of accidental collision, at about 0.0035%. At 100 sextillion objects we are up to a 0.35% chance:



    n=100*1000*1000*1000*1000*1000*1000*1000
    1 - e(-n*(n-1)/(2*r))
    .0034152934013810288444649336362559390942558976421984312395097770719
    923433072593638116228277476790795


    and by 1 septillion objects we're running some serious risks:



    1 - e(-n*(n-1)/(2*r))
    .2897326871923714506502211457721853341644126909116947422293621066225
    555385326652788789421475224989232


    Fortunately, well before then, we've run out of disk space. :-) Also, the Git guys are thinking of moving to one of the SHA-256 hashes, which will raise r to 2256, which helps out our denominator.



    (I'm using bc above, in which ^ is exponentiation and the -l library adds e(x) to compute ex.)






    share|improve this answer












    It's easy to prove that two repositories are related: if they contain matching commits—commits with the same hash IDs and contents, although "same hash IDs" is usually sufficient1—they are related.



    As you note, it's much harder to prove that they are un-related unless they are both complete (non-shallow, non-single-branch) clones. If both are complete, yet neither has any commit in common with the other, the two repositories are unrelated.



    If you have both repositories and verify that both are complete, simply enumerate all the commit hash IDs in both repositories and look for common IDs. If common IDs exist, the two are probably related. To enumerate all the commit hash IDs, run git rev-list --all (redirect output to file or to program that reads both sets of outputs and checks for common hash IDs).



    See footnote 1 for eliminating "probably", but the TL;DR is that for now, any two identical IDs means shared history.




    1Given a uniform hash function h(k) whose range is r = |{h(k)|, the probability of an accidental collision of hashes for two distinct keys k1, k2 is p = 1/r. The probability of uniqueness is the complement of this is p̄ = 1 - (1/r). Generalizing to n keys and using a two-term Taylor expansion of ex ≈ 1 + x for x ≪ 1, we get p̄ ≈ e(-n(n-1)) / 2r, as long as r is reasonably large.



    Git's hash function is currently SHA1, which has a pretty uniform distribution and has r = 2160 = 1461501637330902918203684832716283019655932542976. This satisfies our formula, which means we can use the approximation.



    Hence, if you sum up the total number of hashes n and plug it in to the formula:



    r=1461501637330902918203684832716283019655932542976
    1 - exp(-n*(n-1)/(2*r))


    (remember, we want p, not p-bar),
    you get the probability of a collision. To check for an actual collision, of course, you can just compare the underlying actual objects: if the hashes match, compare the objects directly to detect a collision. But it's extremely unlikely in the first place. If we take two repositories that, together, contain ten million commits, we compute:



    $ bc -l
    r=2^160
    n=10*1000*1000
    scale=100
    1 - e(-n*(n-1)/(2*r))
    .0000000000000000000000000000000000342113848680412753525884397196522
    895097282878872708411144841034243


    which as you can see is still pretty tiny. It's not until we get to:



    n=10*1000*1000*1000*1000*1000*1000*1000


    (ten sextillion objects, using the short scale notation)



    that we find:



    1 - e(-n*(n-1)/(2*r))
    .0000342108030863093209851036344159518189002166758764416221121344549
    079733424124497666779807175655625


    a noticeable chance of accidental collision, at about 0.0035%. At 100 sextillion objects we are up to a 0.35% chance:



    n=100*1000*1000*1000*1000*1000*1000*1000
    1 - e(-n*(n-1)/(2*r))
    .0034152934013810288444649336362559390942558976421984312395097770719
    923433072593638116228277476790795


    and by 1 septillion objects we're running some serious risks:



    1 - e(-n*(n-1)/(2*r))
    .2897326871923714506502211457721853341644126909116947422293621066225
    555385326652788789421475224989232


    Fortunately, well before then, we've run out of disk space. :-) Also, the Git guys are thinking of moving to one of the SHA-256 hashes, which will raise r to 2256, which helps out our denominator.



    (I'm using bc above, in which ^ is exponentiation and the -l library adds e(x) to compute ex.)







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 11 at 21:34









    torek

    179k17229308




    179k17229308











    • Great! But you forget to add practical examples, like both git rev-list --all can be sorted and passed to comm -12 repo1.revs repo2.revs ))
      – gavenkoa
      Nov 12 at 0:00






    • 1




      Well, that assumes bash, sort, and comm. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
      – torek
      Nov 12 at 0:18
















    • Great! But you forget to add practical examples, like both git rev-list --all can be sorted and passed to comm -12 repo1.revs repo2.revs ))
      – gavenkoa
      Nov 12 at 0:00






    • 1




      Well, that assumes bash, sort, and comm. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
      – torek
      Nov 12 at 0:18















    Great! But you forget to add practical examples, like both git rev-list --all can be sorted and passed to comm -12 repo1.revs repo2.revs ))
    – gavenkoa
    Nov 12 at 0:00




    Great! But you forget to add practical examples, like both git rev-list --all can be sorted and passed to comm -12 repo1.revs repo2.revs ))
    – gavenkoa
    Nov 12 at 0:00




    1




    1




    Well, that assumes bash, sort, and comm. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
    – torek
    Nov 12 at 0:18




    Well, that assumes bash, sort, and comm. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
    – torek
    Nov 12 at 0:18










    up vote
    0
    down vote













    To find any common revisions:



    comm -12 <(cd repo1; git rev-list --all | sort) <(cd repo2; git rev-list --all | sort)


    In my case both repositories share commits:



    bash# (cd spring-petclinic; git rev-list --all) | wc -l
    670

    bash# (cd spring-framework-petclinic; git rev-list --all) | wc -l
    571

    bash# comm -12 <(cd spring-framework-petclinic/; git rev-list --all | sort) <(cd spring-petclinic; git rev-list --all | sort) | wc -l
    427





    share|improve this answer
























      up vote
      0
      down vote













      To find any common revisions:



      comm -12 <(cd repo1; git rev-list --all | sort) <(cd repo2; git rev-list --all | sort)


      In my case both repositories share commits:



      bash# (cd spring-petclinic; git rev-list --all) | wc -l
      670

      bash# (cd spring-framework-petclinic; git rev-list --all) | wc -l
      571

      bash# comm -12 <(cd spring-framework-petclinic/; git rev-list --all | sort) <(cd spring-petclinic; git rev-list --all | sort) | wc -l
      427





      share|improve this answer






















        up vote
        0
        down vote










        up vote
        0
        down vote









        To find any common revisions:



        comm -12 <(cd repo1; git rev-list --all | sort) <(cd repo2; git rev-list --all | sort)


        In my case both repositories share commits:



        bash# (cd spring-petclinic; git rev-list --all) | wc -l
        670

        bash# (cd spring-framework-petclinic; git rev-list --all) | wc -l
        571

        bash# comm -12 <(cd spring-framework-petclinic/; git rev-list --all | sort) <(cd spring-petclinic; git rev-list --all | sort) | wc -l
        427





        share|improve this answer












        To find any common revisions:



        comm -12 <(cd repo1; git rev-list --all | sort) <(cd repo2; git rev-list --all | sort)


        In my case both repositories share commits:



        bash# (cd spring-petclinic; git rev-list --all) | wc -l
        670

        bash# (cd spring-framework-petclinic; git rev-list --all) | wc -l
        571

        bash# comm -12 <(cd spring-framework-petclinic/; git rev-list --all | sort) <(cd spring-petclinic; git rev-list --all | sort) | wc -l
        427






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 12 at 0:02









        gavenkoa

        22.3k8138182




        22.3k8138182



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53247953%2fwhat-is-a-simple-way-to-check-if-two-git-repositories-are-unrelated%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            How to read a connectionString WITH PROVIDER in .NET Core?

            In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

            Museum of Modern and Contemporary Art of Trento and Rovereto