Other posts in series
0. Introduction
A couple of days ago I came across this interesting bitcointalk post. Forum member jlcooke made available a spreadsheet of the amount of bitcoin held by each investor at Just-Dice.com. The dataset is interesting, indicating a large number of small investors holding a small portion of the overall investment and a small number of large investors.
This sort of data often shows a significant point of inverse equality - where x% of one variable is responsible for 100 - x% of another variable. In this case, for the dataset from the 27th of October 2013, 11.9% of investors own 88.1% of the total invested at Just-Dice.com:
1. Fitting the data to a distribution
In my experience, random variables that exhibit this sort of inequality (monetary ownership, hashrate ownership, file sizes on a network) are usually distributed as either Pareto random variables or Log normal random variables. I used MLEs to fit Pareto and Log normal distributions to the data and plotted the results against the empirical complementary (upper tail) CDF (CCDF) of the 27th October data, on a log-log scale to show the different upper tails more clearly.
Generally, Pareto distributed random variables will have a straight tail on a log-log CCDF, and a log normally distributed random variables will sooner or later curve downward.
Even without using the Anderson Darling test it's clear that neither the Pareto or Log normal models fit the tails very well, so the next step for these sorts of random variables is to plot the log histogram of the data - a log normally distributed random variable will appear normally distributed and a Pareto distributed random variable will have a higher kurtosis than a normal distribution and often is quite skew.
However for the Just-Dice.com investor data, neither seems to be the case - the histogram appears to be that of a normally distributed random variable with a bite taken out of one side. The histogram is clearly at least bimodal, meaning that there are probably at least two groups of differently distributed random variables mixed into one group.
2. Fitting the data to a mix of multiple distributions
The method outlined in this link can help determine the nature of the mixture. However this method is heavily dependent on the kernel and bandwidth used for the density estimation, and even using a density estimated using biased cross validation, I still ended up with at least four different normally distributed random variables in the mixture, which seemed a bit too much like overfitting.
The BIC or the AIC can be used as a criterion for model selection, and mclust (an R package) uses the BIC. Based on this criterion, a mixture of two normally distributed random variables was selected as the least likely to overfit. The chart below illustrates the mixture models compared to the density histogram of the log of investor bitcoin holdings, for the four dates available.
It should be noted that this does not prove that there are two main groups of investors (small and large), just that the log of the investment data can be modelled well if that is the case. If it is indeed the case, the plots below suggest that over time the average of both groups is increasing, the spread (standard deviation) of the small investors is also increasing but the spread of the large investors is decreasing. The density proportion (lambda) of the small investors is also increasing. If this data is followed over time, it may become apparent whether the bulk of these changes are due to new investors (or current investors adding to their invested amount) or due to the slow appreciation of the investments themselves (in proportion to the amount invested).
3. A tail of two distributions
So, to return to the tail analysis from section one, below is the same empirical upper tail CDF compared to the mixture model upper tail CDF. The fit is quite good.
and for all available data to date:
Finally, the fit of the mixture model to the data can be estimated by creating probability functions for the mixture models and using the Anderson Darling test, the resulting test statistics resulted in p values ranging from 0.3645 to 0.5426 for the various weekly datasets - meaning that the null hypothesis "Variables from the mixture model and the data belong to the same probability distribution" cannot be rejected.
4. Summary
- The amounts of investment attributable to each investment account can be modelled as a mixture of two log normally distributed random variables. The mixture parameters are detailed in the table below:
organofcorti.blogspot.com is a reader supported blog
BTC: 12QxPHEuxDrs7mCyGSx1iVSozTwtquDB3r
For notification of new posts, follow @oocBlog
Find a typo or spelling error? Email me with the details at organofcorti@organofcorti.org and if you're the first to email me I'll pay you per error:

I'm terrible at proofreading, so some of these posts may be worth quite a bit to the keen reader.
<fourteenpoint six>










No comments:
Post a Comment
Comments are switched off until the current spam storm ends.