What is a simple way to check if two git repositories are unrelated?
up vote
0
down vote
favorite
What is a simple way to check if two git repositories are unrelated?
For example let's assume we cloned following repositories:
- https://github.com/spring-petclinic/spring-framework-petclinic
- https://github.com/spring-projects/spring-petclinic
How can I check that one doesn't share history with another?
How can I check that one share partial history with another? How can I browse common DAG and view difference?
Note: Git allows shallow copy + repositories history can diverge with a time...
git
add a comment |
up vote
0
down vote
favorite
What is a simple way to check if two git repositories are unrelated?
For example let's assume we cloned following repositories:
- https://github.com/spring-petclinic/spring-framework-petclinic
- https://github.com/spring-projects/spring-petclinic
How can I check that one doesn't share history with another?
How can I check that one share partial history with another? How can I browse common DAG and view difference?
Note: Git allows shallow copy + repositories history can diverge with a time...
git
Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
– Tim Biegeleisen
Nov 11 at 10:52
You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
– Philippe
Nov 11 at 12:05
Here, we could see that they are related...
– Philippe
Nov 11 at 12:12
@Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
– gavenkoa
Nov 11 at 13:16
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
What is a simple way to check if two git repositories are unrelated?
For example let's assume we cloned following repositories:
- https://github.com/spring-petclinic/spring-framework-petclinic
- https://github.com/spring-projects/spring-petclinic
How can I check that one doesn't share history with another?
How can I check that one share partial history with another? How can I browse common DAG and view difference?
Note: Git allows shallow copy + repositories history can diverge with a time...
git
What is a simple way to check if two git repositories are unrelated?
For example let's assume we cloned following repositories:
- https://github.com/spring-petclinic/spring-framework-petclinic
- https://github.com/spring-projects/spring-petclinic
How can I check that one doesn't share history with another?
How can I check that one share partial history with another? How can I browse common DAG and view difference?
Note: Git allows shallow copy + repositories history can diverge with a time...
git
git
asked Nov 11 at 10:44
gavenkoa
22.3k8138182
22.3k8138182
Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
– Tim Biegeleisen
Nov 11 at 10:52
You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
– Philippe
Nov 11 at 12:05
Here, we could see that they are related...
– Philippe
Nov 11 at 12:12
@Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
– gavenkoa
Nov 11 at 13:16
add a comment |
Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
– Tim Biegeleisen
Nov 11 at 10:52
You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
– Philippe
Nov 11 at 12:05
Here, we could see that they are related...
– Philippe
Nov 11 at 12:12
@Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
– gavenkoa
Nov 11 at 13:16
Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
– Tim Biegeleisen
Nov 11 at 10:52
Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
– Tim Biegeleisen
Nov 11 at 10:52
You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
– Philippe
Nov 11 at 12:05
You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
– Philippe
Nov 11 at 12:05
Here, we could see that they are related...
– Philippe
Nov 11 at 12:12
Here, we could see that they are related...
– Philippe
Nov 11 at 12:12
@Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
– gavenkoa
Nov 11 at 13:16
@Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
– gavenkoa
Nov 11 at 13:16
add a comment |
3 Answers
3
active
oldest
votes
up vote
3
down vote
You can clone the first one, then add the second one as a additional remote:
git clone https://github.com/spring-petclinic/spring-framework-petclinic
cd spring-framework-petclinic
git remote add other https://github.com/spring-projects/spring-petclinic
git fetch --all
Then you can browse both DAGs:
git log --graph --all --oneline --decorate
And see if they have any common history by looking at the merge base of two trunks:
git merge-base origin/master other/master
How can I clean repository aftergit fetch --all other
?
– gavenkoa
Nov 11 at 13:19
1
You can simply remove the remote:git remote rm other
, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc:git -c gc.reflogExpire=now gc --prune=all
(WARNING: this will remove ALL stale objects)
– zigarn
Nov 12 at 10:10
add a comment |
up vote
1
down vote
It's easy to prove that two repositories are related: if they contain matching commits—commits with the same hash IDs and contents, although "same hash IDs" is usually sufficient1—they are related.
As you note, it's much harder to prove that they are un-related unless they are both complete (non-shallow, non-single-branch) clones. If both are complete, yet neither has any commit in common with the other, the two repositories are unrelated.
If you have both repositories and verify that both are complete, simply enumerate all the commit hash IDs in both repositories and look for common IDs. If common IDs exist, the two are probably related. To enumerate all the commit hash IDs, run git rev-list --all
(redirect output to file or to program that reads both sets of outputs and checks for common hash IDs).
See footnote 1 for eliminating "probably", but the TL;DR is that for now, any two identical IDs means shared history.
1Given a uniform hash function h(k) whose range is r = |{h(k)|, the probability of an accidental collision of hashes for two distinct keys k1, k2 is p = 1/r. The probability of uniqueness is the complement of this is p̄ = 1 - (1/r). Generalizing to n keys and using a two-term Taylor expansion of ex ≈ 1 + x for x ≪ 1, we get p̄ ≈ e(-n(n-1)) / 2r, as long as r is reasonably large.
Git's hash function is currently SHA1, which has a pretty uniform distribution and has r = 2160 = 1461501637330902918203684832716283019655932542976. This satisfies our formula, which means we can use the approximation.
Hence, if you sum up the total number of hashes n and plug it in to the formula:
r=1461501637330902918203684832716283019655932542976
1 - exp(-n*(n-1)/(2*r))
(remember, we want p, not p-bar),
you get the probability of a collision. To check for an actual collision, of course, you can just compare the underlying actual objects: if the hashes match, compare the objects directly to detect a collision. But it's extremely unlikely in the first place. If we take two repositories that, together, contain ten million commits, we compute:
$ bc -l
r=2^160
n=10*1000*1000
scale=100
1 - e(-n*(n-1)/(2*r))
.0000000000000000000000000000000000342113848680412753525884397196522
895097282878872708411144841034243
which as you can see is still pretty tiny. It's not until we get to:
n=10*1000*1000*1000*1000*1000*1000*1000
(ten sextillion objects, using the short scale notation)
that we find:
1 - e(-n*(n-1)/(2*r))
.0000342108030863093209851036344159518189002166758764416221121344549
079733424124497666779807175655625
a noticeable chance of accidental collision, at about 0.0035%. At 100 sextillion objects we are up to a 0.35% chance:
n=100*1000*1000*1000*1000*1000*1000*1000
1 - e(-n*(n-1)/(2*r))
.0034152934013810288444649336362559390942558976421984312395097770719
923433072593638116228277476790795
and by 1 septillion objects we're running some serious risks:
1 - e(-n*(n-1)/(2*r))
.2897326871923714506502211457721853341644126909116947422293621066225
555385326652788789421475224989232
Fortunately, well before then, we've run out of disk space. :-) Also, the Git guys are thinking of moving to one of the SHA-256 hashes, which will raise r to 2256, which helps out our denominator.
(I'm using bc
above, in which ^
is exponentiation and the -l
library adds e(x)
to compute ex.)
Great! But you forget to add practical examples, like bothgit rev-list --all
can be sorted and passed tocomm -12 repo1.revs repo2.revs
))
– gavenkoa
Nov 12 at 0:00
1
Well, that assumesbash
,sort
, andcomm
. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
– torek
Nov 12 at 0:18
add a comment |
up vote
0
down vote
To find any common revisions:
comm -12 <(cd repo1; git rev-list --all | sort) <(cd repo2; git rev-list --all | sort)
In my case both repositories share commits:
bash# (cd spring-petclinic; git rev-list --all) | wc -l
670
bash# (cd spring-framework-petclinic; git rev-list --all) | wc -l
571
bash# comm -12 <(cd spring-framework-petclinic/; git rev-list --all | sort) <(cd spring-petclinic; git rev-list --all | sort) | wc -l
427
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
You can clone the first one, then add the second one as a additional remote:
git clone https://github.com/spring-petclinic/spring-framework-petclinic
cd spring-framework-petclinic
git remote add other https://github.com/spring-projects/spring-petclinic
git fetch --all
Then you can browse both DAGs:
git log --graph --all --oneline --decorate
And see if they have any common history by looking at the merge base of two trunks:
git merge-base origin/master other/master
How can I clean repository aftergit fetch --all other
?
– gavenkoa
Nov 11 at 13:19
1
You can simply remove the remote:git remote rm other
, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc:git -c gc.reflogExpire=now gc --prune=all
(WARNING: this will remove ALL stale objects)
– zigarn
Nov 12 at 10:10
add a comment |
up vote
3
down vote
You can clone the first one, then add the second one as a additional remote:
git clone https://github.com/spring-petclinic/spring-framework-petclinic
cd spring-framework-petclinic
git remote add other https://github.com/spring-projects/spring-petclinic
git fetch --all
Then you can browse both DAGs:
git log --graph --all --oneline --decorate
And see if they have any common history by looking at the merge base of two trunks:
git merge-base origin/master other/master
How can I clean repository aftergit fetch --all other
?
– gavenkoa
Nov 11 at 13:19
1
You can simply remove the remote:git remote rm other
, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc:git -c gc.reflogExpire=now gc --prune=all
(WARNING: this will remove ALL stale objects)
– zigarn
Nov 12 at 10:10
add a comment |
up vote
3
down vote
up vote
3
down vote
You can clone the first one, then add the second one as a additional remote:
git clone https://github.com/spring-petclinic/spring-framework-petclinic
cd spring-framework-petclinic
git remote add other https://github.com/spring-projects/spring-petclinic
git fetch --all
Then you can browse both DAGs:
git log --graph --all --oneline --decorate
And see if they have any common history by looking at the merge base of two trunks:
git merge-base origin/master other/master
You can clone the first one, then add the second one as a additional remote:
git clone https://github.com/spring-petclinic/spring-framework-petclinic
cd spring-framework-petclinic
git remote add other https://github.com/spring-projects/spring-petclinic
git fetch --all
Then you can browse both DAGs:
git log --graph --all --oneline --decorate
And see if they have any common history by looking at the merge base of two trunks:
git merge-base origin/master other/master
edited Nov 12 at 10:13
answered Nov 11 at 11:21
zigarn
5,10611628
5,10611628
How can I clean repository aftergit fetch --all other
?
– gavenkoa
Nov 11 at 13:19
1
You can simply remove the remote:git remote rm other
, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc:git -c gc.reflogExpire=now gc --prune=all
(WARNING: this will remove ALL stale objects)
– zigarn
Nov 12 at 10:10
add a comment |
How can I clean repository aftergit fetch --all other
?
– gavenkoa
Nov 11 at 13:19
1
You can simply remove the remote:git remote rm other
, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc:git -c gc.reflogExpire=now gc --prune=all
(WARNING: this will remove ALL stale objects)
– zigarn
Nov 12 at 10:10
How can I clean repository after
git fetch --all other
?– gavenkoa
Nov 11 at 13:19
How can I clean repository after
git fetch --all other
?– gavenkoa
Nov 11 at 13:19
1
1
You can simply remove the remote:
git remote rm other
, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc: git -c gc.reflogExpire=now gc --prune=all
(WARNING: this will remove ALL stale objects)– zigarn
Nov 12 at 10:10
You can simply remove the remote:
git remote rm other
, this will remove the references, if you want to clean the object before they are gc'ed in the future, you can force a gc: git -c gc.reflogExpire=now gc --prune=all
(WARNING: this will remove ALL stale objects)– zigarn
Nov 12 at 10:10
add a comment |
up vote
1
down vote
It's easy to prove that two repositories are related: if they contain matching commits—commits with the same hash IDs and contents, although "same hash IDs" is usually sufficient1—they are related.
As you note, it's much harder to prove that they are un-related unless they are both complete (non-shallow, non-single-branch) clones. If both are complete, yet neither has any commit in common with the other, the two repositories are unrelated.
If you have both repositories and verify that both are complete, simply enumerate all the commit hash IDs in both repositories and look for common IDs. If common IDs exist, the two are probably related. To enumerate all the commit hash IDs, run git rev-list --all
(redirect output to file or to program that reads both sets of outputs and checks for common hash IDs).
See footnote 1 for eliminating "probably", but the TL;DR is that for now, any two identical IDs means shared history.
1Given a uniform hash function h(k) whose range is r = |{h(k)|, the probability of an accidental collision of hashes for two distinct keys k1, k2 is p = 1/r. The probability of uniqueness is the complement of this is p̄ = 1 - (1/r). Generalizing to n keys and using a two-term Taylor expansion of ex ≈ 1 + x for x ≪ 1, we get p̄ ≈ e(-n(n-1)) / 2r, as long as r is reasonably large.
Git's hash function is currently SHA1, which has a pretty uniform distribution and has r = 2160 = 1461501637330902918203684832716283019655932542976. This satisfies our formula, which means we can use the approximation.
Hence, if you sum up the total number of hashes n and plug it in to the formula:
r=1461501637330902918203684832716283019655932542976
1 - exp(-n*(n-1)/(2*r))
(remember, we want p, not p-bar),
you get the probability of a collision. To check for an actual collision, of course, you can just compare the underlying actual objects: if the hashes match, compare the objects directly to detect a collision. But it's extremely unlikely in the first place. If we take two repositories that, together, contain ten million commits, we compute:
$ bc -l
r=2^160
n=10*1000*1000
scale=100
1 - e(-n*(n-1)/(2*r))
.0000000000000000000000000000000000342113848680412753525884397196522
895097282878872708411144841034243
which as you can see is still pretty tiny. It's not until we get to:
n=10*1000*1000*1000*1000*1000*1000*1000
(ten sextillion objects, using the short scale notation)
that we find:
1 - e(-n*(n-1)/(2*r))
.0000342108030863093209851036344159518189002166758764416221121344549
079733424124497666779807175655625
a noticeable chance of accidental collision, at about 0.0035%. At 100 sextillion objects we are up to a 0.35% chance:
n=100*1000*1000*1000*1000*1000*1000*1000
1 - e(-n*(n-1)/(2*r))
.0034152934013810288444649336362559390942558976421984312395097770719
923433072593638116228277476790795
and by 1 septillion objects we're running some serious risks:
1 - e(-n*(n-1)/(2*r))
.2897326871923714506502211457721853341644126909116947422293621066225
555385326652788789421475224989232
Fortunately, well before then, we've run out of disk space. :-) Also, the Git guys are thinking of moving to one of the SHA-256 hashes, which will raise r to 2256, which helps out our denominator.
(I'm using bc
above, in which ^
is exponentiation and the -l
library adds e(x)
to compute ex.)
Great! But you forget to add practical examples, like bothgit rev-list --all
can be sorted and passed tocomm -12 repo1.revs repo2.revs
))
– gavenkoa
Nov 12 at 0:00
1
Well, that assumesbash
,sort
, andcomm
. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
– torek
Nov 12 at 0:18
add a comment |
up vote
1
down vote
It's easy to prove that two repositories are related: if they contain matching commits—commits with the same hash IDs and contents, although "same hash IDs" is usually sufficient1—they are related.
As you note, it's much harder to prove that they are un-related unless they are both complete (non-shallow, non-single-branch) clones. If both are complete, yet neither has any commit in common with the other, the two repositories are unrelated.
If you have both repositories and verify that both are complete, simply enumerate all the commit hash IDs in both repositories and look for common IDs. If common IDs exist, the two are probably related. To enumerate all the commit hash IDs, run git rev-list --all
(redirect output to file or to program that reads both sets of outputs and checks for common hash IDs).
See footnote 1 for eliminating "probably", but the TL;DR is that for now, any two identical IDs means shared history.
1Given a uniform hash function h(k) whose range is r = |{h(k)|, the probability of an accidental collision of hashes for two distinct keys k1, k2 is p = 1/r. The probability of uniqueness is the complement of this is p̄ = 1 - (1/r). Generalizing to n keys and using a two-term Taylor expansion of ex ≈ 1 + x for x ≪ 1, we get p̄ ≈ e(-n(n-1)) / 2r, as long as r is reasonably large.
Git's hash function is currently SHA1, which has a pretty uniform distribution and has r = 2160 = 1461501637330902918203684832716283019655932542976. This satisfies our formula, which means we can use the approximation.
Hence, if you sum up the total number of hashes n and plug it in to the formula:
r=1461501637330902918203684832716283019655932542976
1 - exp(-n*(n-1)/(2*r))
(remember, we want p, not p-bar),
you get the probability of a collision. To check for an actual collision, of course, you can just compare the underlying actual objects: if the hashes match, compare the objects directly to detect a collision. But it's extremely unlikely in the first place. If we take two repositories that, together, contain ten million commits, we compute:
$ bc -l
r=2^160
n=10*1000*1000
scale=100
1 - e(-n*(n-1)/(2*r))
.0000000000000000000000000000000000342113848680412753525884397196522
895097282878872708411144841034243
which as you can see is still pretty tiny. It's not until we get to:
n=10*1000*1000*1000*1000*1000*1000*1000
(ten sextillion objects, using the short scale notation)
that we find:
1 - e(-n*(n-1)/(2*r))
.0000342108030863093209851036344159518189002166758764416221121344549
079733424124497666779807175655625
a noticeable chance of accidental collision, at about 0.0035%. At 100 sextillion objects we are up to a 0.35% chance:
n=100*1000*1000*1000*1000*1000*1000*1000
1 - e(-n*(n-1)/(2*r))
.0034152934013810288444649336362559390942558976421984312395097770719
923433072593638116228277476790795
and by 1 septillion objects we're running some serious risks:
1 - e(-n*(n-1)/(2*r))
.2897326871923714506502211457721853341644126909116947422293621066225
555385326652788789421475224989232
Fortunately, well before then, we've run out of disk space. :-) Also, the Git guys are thinking of moving to one of the SHA-256 hashes, which will raise r to 2256, which helps out our denominator.
(I'm using bc
above, in which ^
is exponentiation and the -l
library adds e(x)
to compute ex.)
Great! But you forget to add practical examples, like bothgit rev-list --all
can be sorted and passed tocomm -12 repo1.revs repo2.revs
))
– gavenkoa
Nov 12 at 0:00
1
Well, that assumesbash
,sort
, andcomm
. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
– torek
Nov 12 at 0:18
add a comment |
up vote
1
down vote
up vote
1
down vote
It's easy to prove that two repositories are related: if they contain matching commits—commits with the same hash IDs and contents, although "same hash IDs" is usually sufficient1—they are related.
As you note, it's much harder to prove that they are un-related unless they are both complete (non-shallow, non-single-branch) clones. If both are complete, yet neither has any commit in common with the other, the two repositories are unrelated.
If you have both repositories and verify that both are complete, simply enumerate all the commit hash IDs in both repositories and look for common IDs. If common IDs exist, the two are probably related. To enumerate all the commit hash IDs, run git rev-list --all
(redirect output to file or to program that reads both sets of outputs and checks for common hash IDs).
See footnote 1 for eliminating "probably", but the TL;DR is that for now, any two identical IDs means shared history.
1Given a uniform hash function h(k) whose range is r = |{h(k)|, the probability of an accidental collision of hashes for two distinct keys k1, k2 is p = 1/r. The probability of uniqueness is the complement of this is p̄ = 1 - (1/r). Generalizing to n keys and using a two-term Taylor expansion of ex ≈ 1 + x for x ≪ 1, we get p̄ ≈ e(-n(n-1)) / 2r, as long as r is reasonably large.
Git's hash function is currently SHA1, which has a pretty uniform distribution and has r = 2160 = 1461501637330902918203684832716283019655932542976. This satisfies our formula, which means we can use the approximation.
Hence, if you sum up the total number of hashes n and plug it in to the formula:
r=1461501637330902918203684832716283019655932542976
1 - exp(-n*(n-1)/(2*r))
(remember, we want p, not p-bar),
you get the probability of a collision. To check for an actual collision, of course, you can just compare the underlying actual objects: if the hashes match, compare the objects directly to detect a collision. But it's extremely unlikely in the first place. If we take two repositories that, together, contain ten million commits, we compute:
$ bc -l
r=2^160
n=10*1000*1000
scale=100
1 - e(-n*(n-1)/(2*r))
.0000000000000000000000000000000000342113848680412753525884397196522
895097282878872708411144841034243
which as you can see is still pretty tiny. It's not until we get to:
n=10*1000*1000*1000*1000*1000*1000*1000
(ten sextillion objects, using the short scale notation)
that we find:
1 - e(-n*(n-1)/(2*r))
.0000342108030863093209851036344159518189002166758764416221121344549
079733424124497666779807175655625
a noticeable chance of accidental collision, at about 0.0035%. At 100 sextillion objects we are up to a 0.35% chance:
n=100*1000*1000*1000*1000*1000*1000*1000
1 - e(-n*(n-1)/(2*r))
.0034152934013810288444649336362559390942558976421984312395097770719
923433072593638116228277476790795
and by 1 septillion objects we're running some serious risks:
1 - e(-n*(n-1)/(2*r))
.2897326871923714506502211457721853341644126909116947422293621066225
555385326652788789421475224989232
Fortunately, well before then, we've run out of disk space. :-) Also, the Git guys are thinking of moving to one of the SHA-256 hashes, which will raise r to 2256, which helps out our denominator.
(I'm using bc
above, in which ^
is exponentiation and the -l
library adds e(x)
to compute ex.)
It's easy to prove that two repositories are related: if they contain matching commits—commits with the same hash IDs and contents, although "same hash IDs" is usually sufficient1—they are related.
As you note, it's much harder to prove that they are un-related unless they are both complete (non-shallow, non-single-branch) clones. If both are complete, yet neither has any commit in common with the other, the two repositories are unrelated.
If you have both repositories and verify that both are complete, simply enumerate all the commit hash IDs in both repositories and look for common IDs. If common IDs exist, the two are probably related. To enumerate all the commit hash IDs, run git rev-list --all
(redirect output to file or to program that reads both sets of outputs and checks for common hash IDs).
See footnote 1 for eliminating "probably", but the TL;DR is that for now, any two identical IDs means shared history.
1Given a uniform hash function h(k) whose range is r = |{h(k)|, the probability of an accidental collision of hashes for two distinct keys k1, k2 is p = 1/r. The probability of uniqueness is the complement of this is p̄ = 1 - (1/r). Generalizing to n keys and using a two-term Taylor expansion of ex ≈ 1 + x for x ≪ 1, we get p̄ ≈ e(-n(n-1)) / 2r, as long as r is reasonably large.
Git's hash function is currently SHA1, which has a pretty uniform distribution and has r = 2160 = 1461501637330902918203684832716283019655932542976. This satisfies our formula, which means we can use the approximation.
Hence, if you sum up the total number of hashes n and plug it in to the formula:
r=1461501637330902918203684832716283019655932542976
1 - exp(-n*(n-1)/(2*r))
(remember, we want p, not p-bar),
you get the probability of a collision. To check for an actual collision, of course, you can just compare the underlying actual objects: if the hashes match, compare the objects directly to detect a collision. But it's extremely unlikely in the first place. If we take two repositories that, together, contain ten million commits, we compute:
$ bc -l
r=2^160
n=10*1000*1000
scale=100
1 - e(-n*(n-1)/(2*r))
.0000000000000000000000000000000000342113848680412753525884397196522
895097282878872708411144841034243
which as you can see is still pretty tiny. It's not until we get to:
n=10*1000*1000*1000*1000*1000*1000*1000
(ten sextillion objects, using the short scale notation)
that we find:
1 - e(-n*(n-1)/(2*r))
.0000342108030863093209851036344159518189002166758764416221121344549
079733424124497666779807175655625
a noticeable chance of accidental collision, at about 0.0035%. At 100 sextillion objects we are up to a 0.35% chance:
n=100*1000*1000*1000*1000*1000*1000*1000
1 - e(-n*(n-1)/(2*r))
.0034152934013810288444649336362559390942558976421984312395097770719
923433072593638116228277476790795
and by 1 septillion objects we're running some serious risks:
1 - e(-n*(n-1)/(2*r))
.2897326871923714506502211457721853341644126909116947422293621066225
555385326652788789421475224989232
Fortunately, well before then, we've run out of disk space. :-) Also, the Git guys are thinking of moving to one of the SHA-256 hashes, which will raise r to 2256, which helps out our denominator.
(I'm using bc
above, in which ^
is exponentiation and the -l
library adds e(x)
to compute ex.)
answered Nov 11 at 21:34
torek
179k17229308
179k17229308
Great! But you forget to add practical examples, like bothgit rev-list --all
can be sorted and passed tocomm -12 repo1.revs repo2.revs
))
– gavenkoa
Nov 12 at 0:00
1
Well, that assumesbash
,sort
, andcomm
. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
– torek
Nov 12 at 0:18
add a comment |
Great! But you forget to add practical examples, like bothgit rev-list --all
can be sorted and passed tocomm -12 repo1.revs repo2.revs
))
– gavenkoa
Nov 12 at 0:00
1
Well, that assumesbash
,sort
, andcomm
. I figured I'd leave that part as an exercise, in case you did not have those available. :-)
– torek
Nov 12 at 0:18
Great! But you forget to add practical examples, like both
git rev-list --all
can be sorted and passed to comm -12 repo1.revs repo2.revs
))– gavenkoa
Nov 12 at 0:00
Great! But you forget to add practical examples, like both
git rev-list --all
can be sorted and passed to comm -12 repo1.revs repo2.revs
))– gavenkoa
Nov 12 at 0:00
1
1
Well, that assumes
bash
, sort
, and comm
. I figured I'd leave that part as an exercise, in case you did not have those available. :-)– torek
Nov 12 at 0:18
Well, that assumes
bash
, sort
, and comm
. I figured I'd leave that part as an exercise, in case you did not have those available. :-)– torek
Nov 12 at 0:18
add a comment |
up vote
0
down vote
To find any common revisions:
comm -12 <(cd repo1; git rev-list --all | sort) <(cd repo2; git rev-list --all | sort)
In my case both repositories share commits:
bash# (cd spring-petclinic; git rev-list --all) | wc -l
670
bash# (cd spring-framework-petclinic; git rev-list --all) | wc -l
571
bash# comm -12 <(cd spring-framework-petclinic/; git rev-list --all | sort) <(cd spring-petclinic; git rev-list --all | sort) | wc -l
427
add a comment |
up vote
0
down vote
To find any common revisions:
comm -12 <(cd repo1; git rev-list --all | sort) <(cd repo2; git rev-list --all | sort)
In my case both repositories share commits:
bash# (cd spring-petclinic; git rev-list --all) | wc -l
670
bash# (cd spring-framework-petclinic; git rev-list --all) | wc -l
571
bash# comm -12 <(cd spring-framework-petclinic/; git rev-list --all | sort) <(cd spring-petclinic; git rev-list --all | sort) | wc -l
427
add a comment |
up vote
0
down vote
up vote
0
down vote
To find any common revisions:
comm -12 <(cd repo1; git rev-list --all | sort) <(cd repo2; git rev-list --all | sort)
In my case both repositories share commits:
bash# (cd spring-petclinic; git rev-list --all) | wc -l
670
bash# (cd spring-framework-petclinic; git rev-list --all) | wc -l
571
bash# comm -12 <(cd spring-framework-petclinic/; git rev-list --all | sort) <(cd spring-petclinic; git rev-list --all | sort) | wc -l
427
To find any common revisions:
comm -12 <(cd repo1; git rev-list --all | sort) <(cd repo2; git rev-list --all | sort)
In my case both repositories share commits:
bash# (cd spring-petclinic; git rev-list --all) | wc -l
670
bash# (cd spring-framework-petclinic; git rev-list --all) | wc -l
571
bash# comm -12 <(cd spring-framework-petclinic/; git rev-list --all | sort) <(cd spring-petclinic; git rev-list --all | sort) | wc -l
427
answered Nov 12 at 0:02
gavenkoa
22.3k8138182
22.3k8138182
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53247953%2fwhat-is-a-simple-way-to-check-if-two-git-repositories-are-unrelated%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Before you go dumpster diving, is there any way that you could ask the maintainers of the two projects if they have anything in common? I mean, if one is a fork of the other, I would expect this to be known.
– Tim Biegeleisen
Nov 11 at 10:52
You could compare the Sha1 of the first commit of each repository. If it's not the same, they are not related. If it's the same, they are/were (but perhaps diverge too much, something you should find a way to evaluate...)
– Philippe
Nov 11 at 12:05
Here, we could see that they are related...
– Philippe
Nov 11 at 12:12
@Philippe As I wrote one repository can be a shallow copy of another. It is not a reliable source.
– gavenkoa
Nov 11 at 13:16