14th September 2015
Probability functions
Focussing on slightly more realistic intervals, say a week's worth of blocks up to a couple of years worth of blocks (1000 to 100,000 blocks):
2. Probability that true proportion of voters = 0.5 when fake + actual votes = 750 / 1000 blocks
The same simulation method as above was used, except that each random variable is a sequence of three digits, two zeros and a one, the order indicating which of the three groups (fake BIP 101 voters, actual BIP 101 voters and blocks without the BIP101 version number), and the rolling sum is calculated for each group separately. The true proportion of actual voters was kept constant at 0.5, and the fake voters varied from 0.16 to more than 0.3 (non voters made up the difference from 1.0).
The resulting probability and quantile data:
It bears mentioning that the linear models have no explanatory value, they're just useful if we want to calculate probabilities. A quadratic equation will have either a maxima or minima, so it can't describe data outside the predefined range - the number of blocks until activation will tend to 1 as true proportions increase, although more than two hundred times more slowly than for binomial iid random variables.
Edit: I've changed my mind on this. The "number blocks before activation" variable would be 1 / CDF if the data were binomial iid rvs. The moving sum results seem to be at least related to the binomial iid rv case, and since binomial iid revs can be approximated using the normal distribution which is proportional to exp(-x^2), there seems to be some basis in fact for using the exponential quadratic linear model.
3. Summary
Although the last analysis was flawed and although this analysis shows a reduced risk of fake voters causing premature activation, it's not as significant a difference as I thought it might be, for example 50% of the time premature activation could be caused before 15,000 consecutive blocks by a fake proportion of 0.21 rather than 0.19 as in the previous incorrect estimates.
Although this is not a significant source of worry, it's interesting that - as far as I know - no one had attempted to determine how "loose" the supermajority definition was. "Common sense" is a poor method of working with random variables. If an answer is unknown, and that answer could be important, I think it's best to make an effort to find that answer, and not just assume it will be ok.
0. Introduction
In the last post on BIP101 I discussed some flaws (as I saw them) in the description of BIP101 implementation. I thought (and still think) that no thought had gone into to determining what the confidence intervals for the true proportion of BIP101 voters might be to activate BIP101, based on identifying 750 of 1000 consecutive blocks as voting for BIP101 from the block's version number.
My assumptions were:
1. The true proportion of voters was constant (has plateaued)
2. Each individual test for BIP101 activation would result in a binomially distributed random variable, and
3. Each separate test would be a Bernoulli trial with a constant probability so that the number of blocks until activation would be a geometrically distributed random variable.
However, as pointed out by this redditor and also later on by this commenter, since each activation test is not independent but will have all but one of the same blocks as for the previous test, as a group the test results cannot be independent identically distributed binomial random variables.
This is a silly mistake due to me not fully thinking through the problem. In the past I've explained the same thing to bitcointalk members, so I really don't have any excuse.
So, how could we expect this to affect the results? Since each test is significantly correlated, my guess was that the probabilities would be similar to those I obtain, only with estimates of blocks until activation significantly higher - tens or hundreds or even thousands of times higher, i.e. much less likely.
1. Probability that true proportion of voters is much less than 750 / 1000
I remember reading somewhere something about probability when variables are correlated so I'm pretty sure there's an analytical method but I didn't have the time so I used numerical methods (Monte Carlo simulations) instead.
My assumptions were:
1. The true proportion of voters was constant (has plateaued)
2. Each individual test for BIP101 activation would result in a binomially distributed random variable, and
3. Each separate test would be a Bernoulli trial with a constant probability so that the number of blocks until activation would be a geometrically distributed random variable.
However, as pointed out by this redditor and also later on by this commenter, since each activation test is not independent but will have all but one of the same blocks as for the previous test, as a group the test results cannot be independent identically distributed binomial random variables.
This is a silly mistake due to me not fully thinking through the problem. In the past I've explained the same thing to bitcointalk members, so I really don't have any excuse.
So, how could we expect this to affect the results? Since each test is significantly correlated, my guess was that the probabilities would be similar to those I obtain, only with estimates of blocks until activation significantly higher - tens or hundreds or even thousands of times higher, i.e. much less likely.
1. Probability that true proportion of voters is much less than 750 / 1000
I remember reading somewhere something about probability when variables are correlated so I'm pretty sure there's an analytical method but I didn't have the time so I used numerical methods (Monte Carlo simulations) instead.
- Generate one hundred million binomial random variables where size = 1 and true proportion is some predefined number from 0.65 to 0.8
- Calculate the rolling sums for the entire series, with a window of 1000 and a step of 1.
- Record the sequence number of the first sum to equal (or be greater than, if it's the first sum) 750, and the true proportion at which it occurred. If no sum is greater than 750, record NA.
- Append data to a file.
- Repeat many times for each true proportion.
The extra length of the rolling sum - one hundred million, as opposed to the more likely 25,000 to 50,000 - is to allow us to show which events are improbable.
As the proportion I used a series from 0.65 to 0.8 incrementing by 0.001, and I repeated the process ten thousand times time for each proportion. The results are illustrated in the 2D (hex-binned) histogram below:
As the true proportion of simulated BIP101 voters increases towards - and then past - 0.75, the median number of number of sequential simulated blocks until activation approaches 1. This was interesting to me, because it suggested that the data may still be geometrically distributed and could be approximated using an exponential distribution.
Note that for any given run the estimate of blocks until activation will be significantly affected by the starting value since the rolling sums are not independent. However, the average probability over many simulated runs can be calculated.
If the number of blocks until activation is exponentially distributed the probability of activation occurring before some number of blocks can be calculated directly without reproducing the simulated data.
Comparison to binomially distributed independent, identically distributed binomial random variables
A quick side track here: how does the mean blocks until activation for a particular true proportion compare to the last post results? How much less likely are BIP101 activations at low true proportions, given the fact they are not independent binomially distributed data? It turns out that BIP101 activations would be about 50 to 250 times less likely than if they were independent tests.
I found a neat generalised linear model between the true proportion and the the BIP101 simulated number of blocks until activation divided by the the binomial estimate (exponential family, with exponentially distributed residuals):
binomial prob / rolling sum prob ~ exp(34.44*p^2 - 8.83*p -6.51)
p = true proportion of simulated BIP 101 voters
The linear model is shown below, along with the mean data points:
A generalised linear model to describe the change in probability wrt true proportion
One more quick side track here. As explained above, being able to approximate the data as exponentially distributed makes the activation probability calculable without needing access to simulated data. To make this possible, one other thing is necessary - a mean function for blocks until activation as a function of true proportion. Again, I found a nice meat generalised linear model (exponential family, with exponentially distributed residuals):
blocks until activation = exp(2060.3*p^2 - 3101.9*p + 1174.5)
p = true proportion of simulated BIP 101 voters
The plot below shows this relationship. Again, the linear model is shown compared to mean data points, rather than the data on which the model is based.
Probability functions
After dividing by the mean, the Anderson-Darling goodness of fit test showed the bulk of the data was exponentially distributed (95% confidence). When the mean value of blocks until activation was less than than 100, the approximation by an exponential distribution began to fail due to the the geometric probability being much to high to be well approximated in this way. Missing data also prevented an estimation of goodness of fit.
Only data from true proportions of 0.683 to 0.732 inclusive had sufficiently large mean values and no missing values.
When the data was divided by then generalised linear model values instead of the mean, the data was still fit an exponential distribution with 95% confidence, so I used the model mean function for the= probability function:
Pr = 1 - exp(-x*lambda)
and a quantile function:
x = -log(1 - Pr)/lambda
Pr = Pr(blocks until activation <= x | true proportion = p)
x = number of sequential blocks until activation
lambda = 1 / exp(2060.3*p^2 - 3101.9*p + 1174.5)
p = true proportion of simulated BIP 101 voters
These results seem to fit nicely with the data:
Focussing on slightly more realistic intervals, say a week's worth of blocks up to a couple of years worth of blocks (1000 to 100,000 blocks):
2. Probability that true proportion of voters = 0.5 when fake + actual votes = 750 / 1000 blocks
The same simulation method as above was used, except that each random variable is a sequence of three digits, two zeros and a one, the order indicating which of the three groups (fake BIP 101 voters, actual BIP 101 voters and blocks without the BIP101 version number), and the rolling sum is calculated for each group separately. The true proportion of actual voters was kept constant at 0.5, and the fake voters varied from 0.16 to more than 0.3 (non voters made up the difference from 1.0).
Again, after dividing by the mean number of blocks until activation for a particular true proportion, the data between fake voter proportion = 0.183 and 0.231 could be fit to an exponential distribution (confidence 95%).
Generalised linear model
As in section 1, I found neat generalised linear model of the mean blocks until activation as a function of faker voter true proportion, with similarly well-behaved residuals.
In this case:
blocks until activation = exp(2070.5*p^2 - 1046.3*p + 139.1)
p = true proportion of simulated fake BIP 101 voters
The resulting probability and quantile data:
Pr = 1 - exp(-x*lambda)
and a quantile function:
x = -log(1 - Pr)/lambda
Pr = Pr(blocks until activation <= x | true proportion = p)
x = number of sequential blocks until activation
lambda = 1 / exp(2070.5*p^2 - 1046.3*p + 139.1)
p = true proportion of simulated BIP 101 voters
Edit: I've changed my mind on this. The "number blocks before activation" variable would be 1 / CDF if the data were binomial iid rvs. The moving sum results seem to be at least related to the binomial iid rv case, and since binomial iid revs can be approximated using the normal distribution which is proportional to exp(-x^2), there seems to be some basis in fact for using the exponential quadratic linear model.
3. Summary
Although the last analysis was flawed and although this analysis shows a reduced risk of fake voters causing premature activation, it's not as significant a difference as I thought it might be, for example 50% of the time premature activation could be caused before 15,000 consecutive blocks by a fake proportion of 0.21 rather than 0.19 as in the previous incorrect estimates.
Although this is not a significant source of worry, it's interesting that - as far as I know - no one had attempted to determine how "loose" the supermajority definition was. "Common sense" is a poor method of working with random variables. If an answer is unknown, and that answer could be important, I think it's best to make an effort to find that answer, and not just assume it will be ok.
organofcorti.blogspot.com is a reader supported blog:
1QC2KE4GZ4SZ8AnpwVT483D2E97SLHTGCG
Created using R and various packages, especially dplyr, data.table, and ggplot2.
Find a typo or spelling error? Email me with the details at organofcorti@organofcorti.org and if you're the first to email me I'll pay you 0.01 btc per ten errors.
Please refer to the most recent blog post for current rates or rule changes.
I'm terrible at proofreading, so some of these posts may be worth quite a bit to the keen reader.
Exceptions:
- Errors in text repeated across multiple posts: I will only pay for the most recent errors rather every single occurrence.
- Errors in chart texts: Since I can't fix the chart texts (since I don't keep the data that generated them) I can't pay for them. Still, they would be nice to know about!
I write in British English.
No comments:
Post a Comment
Comments are switched off until the current spam storm ends.