Neighbourhood Pool Watch: Blockchain voting and a comment response

2nd November 2015

Related posts:

BIP101 implementation flaws
Supermajority and BIP101 activation
Blockchain voting
A comment from Vitalik Buterin

0. Introduction

Recently, a comment from Vitalik Buterin (originator of Ethereum smart contracts) on the Blockchain voting post made it clear that there were still some questions about the problems surrounding the current blockchain voting methods; some have answers that I haven't explicitly mentioned and some that I hadn't explained sufficiently well.

You can read the full comment via the link provided above, but I'll summarise them as follows (apologies to Vitalik if I've missed or misinterpreted a point):

1. How do simulation results compare for similar sample sizes?
I generated more extreme error (minimum true proportion from a given observed or sample proportion and maximum observed proportion from a given true proportion) data in the same way as for the last post, but included a 1000 and 2000 sample exponential moving average (EMA) and rolling sum, as well as 1000 samples per 1008 blocks, 1000 sample per 2016 and 2000 samples per 2016 blocks. In the cases of the proportion tests per retarget (8 tests in 16128 blocks) and per half retarget (16 tests in 16128 blocks), the fewer the tests and the greater the sample, the less extreme the errors. The distribution of the rolling sum and EMA were clearly worse than for the per retarget and half retarget trials.

2. What contributes most to maximal voting error - number of blocks or number of trials?

For the trials without overlapping test blocks (that is, not EMA or moving sum methods) we can generate probability distributions without resorting to simulations, and these show more clearly the inter-relationship between number of trials, sample size, and error.

The first plot shows how probability density changes with changing number of trials and sample size;
The second plot shows the change in median and 95% confidence intervals, although only for the distribution of the minimum true proportion for an arbitrary observed sample proportion (they also look like creepy disembodied smiles, ugh).
The final plot is a heat map of the error inter-relationship between the number of trials and sample size.

These visualisations are useful in helping us understand the change in the extremal error distribution. It seems clear to me that:

The error distribution standard deviation increases as sample size reduces.
The error distribution median increases as sample size reduces and as the number of trials increases.

There is no simple linear relationship between the two, and if you want to keep the 95% confidence interval of error between some values, you'll need to do the calculations to find the optimal number of trials and sample size. However, it can be done, which cannot be said about EMA and rolling sum methods.

3. What is an independent random variable and how do moving averages break the independence assumption?

I'm, feeling lazy so I'm just going to quote Wikipedia:

In probability theory, two events are independent, statistically independent, or stochastically independent[1] if the occurrence of one does not affect the probability of the other. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

This means that rolling sums and exponential moving averages are not independant random variables, since the variable at block n+1 can be calculated using the value of block n+1 and the variable from block n. This does not mean these methods are not valid, but it does mean that statistical inference tests can't be used.

Simply put, because unless you go to the trouble of running simulations before voting you can't be sure what the possible values for the underlying true proportions could be, whereas for the independent trials methods that don't re-use random variables (for example per retarget proportion test), numerical and analytical methods of analysis are well known and well understood.

4. "Memorylessness" of moving averages compared to independent trials?
I think the memoryless comment is just a confusion in terms. Independant trials are completely memoryless because nothing from a one trial affects the distribution of a different trial, earlier or later. The trials have no "memory". However exponential moving averages and moving sums are "memoryfull" - the current trial can be calculated using just the most recent block and the last random variable.

Lack of memorylessness means that EMA and moving sums drag their history about with them, and is why they cannot generate independent random variables.

5. Summary

Lack of independence means that rolling sum and EMA voting methods cannot be easily investigated prior to the vote. Confidence intervals for the underlying true proportions of voters cannot easily be determined ahead of time.
Statistical inference tests are easily applicable to independent trials methods, are repeatable, and do not require simulations.
The inter-relationship between sample size, number of trials and extreme error distribution is complex, but generally:

Increasing number of trials increases median extremal error and decreases standard deviation of the extremal error distribution.
Increasing sample size decreases the standard deviation of the extremal error distribution and decreases the median extremal error.

Appendix: Calculating the minimum true proportion for a given sample proportion or maximum sample proportion for a given true proportion.

Order statistics of a uniform distribution can be calculated using the beta distribution:

distribution of the nth order statistic ~ beta(n, n + 1 - k)

where n = sample size and k is the order of the statistic, then for a uniform distribution:

distribution of maxima ~ beta(n, 1)
distribution of minima ~ beta(1, n)

How can this result be used for distributions other than the uniform distribution? All probability distributions can be considered as transforms of the uniform distribution, so in order to calculate the cumulative probability density for the extrema of a continuous probability distribution:

Calculate probability quantiles using the beta distribution quantile function. For example, the 0.025 and 0.975 (95% confidence interval) limits for the probabilities of the maximum and the minimum are (0.9963179 and 0.9999747), and (0.00002531 and 0.0036820), respectively.
The probabilities are used to calculate the continuous distribution quantiles. The binomial distribution is not continuous, but with a large sample size or probability this is not important.
Distribution of maximum sample size: Using the results above, for a sample size of 1000 and a true proportion of 0.75 the 95% confidence interval of the maximum sample is between 768 and 804 blocks, or an observed proportion of 0.768 to 0.804.
Distribution of minimum true proportion: For this, we also need to use some method to calculate the distribution of the true proportion given a particular observed proportion. There are a number of ways to do this, but I use Jeffrey's interval, which uses the beta distribution to estimate confidence intervals for a true proportions. Using this and the probabilities above, the 95% confidence interval for the minimum true proportion for an observed proportion of 750 block of 1000, is 0.6919049 0.7120895.

Organofcorti lives in the blockchain!

organofcorti.blogspot.com is a reader supported blog:

1QC2KE4GZ4SZ8AnpwVT483D2E97SLHTGCG

Created using R and various packages, especially dplyr, data.table, and ggplot2.

Find a typo or spelling error? Email me with the details at organofcorti@organofcorti.org and if you're the first to email me I'll pay you 0.01 btc per ten errors.

Please refer to the most recent blog post for current rates or rule changes.

I'm terrible at proofreading, so some of these posts may be worth quite a bit to the keen reader.

Exceptions:

Errors in text repeated across multiple posts: I will only pay for the most recent errors rather every single occurrence.
Errors in chart texts: Since I can't fix the chart texts (since I don't keep the data that generated them) I can't pay for them. Still, they would be nice to know about.

I write in British English.

Neighbourhood Pool Watch

Pages

Sunday, 8 November 2015

Blockchain voting and a comment response

No comments:

Post a Comment