Tuesday, July 15, 2014
Wikipedia
I have just completed the wikipedia page for the Imprecise Dirichlet process, https://en.wikipedia.org/wiki/Imprecise_Dirichlet_process ...any useful contribution/modification is welcome.
Wednesday, May 28, 2014
Comparing climbing performance
This post shows how to use the IDP statistical package to compare sport performance.
As case study, I have considered (just for fun) the comparison of my climbing performance in two consecutive editions (2013 and 2014) of the "Tre Valli Bresciane" cycling race.
The following table reports my ascent time on 6 different climbs on the 2013 and 2014 editions of the race.
Climb | 2013 | 2014 |
---|---|---|
A | 29m16s | 29m14s |
B | 29m03s | 29m02s |
C | 23m28s | 23m01s |
D | 48m04s | 45m56s |
E | 28m51s | 30m43s |
F | 15m09s | 14m53s |
A classical non-parametric statistical hypothesis test used when comparing two matched samples is the Wilcoxon signed-rank test. Our goal is to employ this test to assess whether my climbing performance on 2013 are worse (larger ascent time) than that on 2014. Therefore, we are going to perform a one-sided test.
T14 <- c(29*60+14, 29*60+02, 23*60+01, 45*60+56, 30*60+43, 14*60+53)
T13 <- c(29*60+16, 29*60+03, 23*60+28, 48*60+04, 28*60+51, 15*60+09)
wilcox.test(T13,T14,"greater", paired=TRUE)
while, in Matlab, this test can be performed by means of the function signrank:
T14=[29*60+14, 29*60+02, 23*60+01, 45*60+56, 30*60+43, 14*60+53];
T13=[29*60+16, 29*60+03, 23*60+28, 48*60+04, 28*60+51, 15*60+09];
[p,h]=signrank(T13,T14,'tail','right')
Note that the conversion to seconds is actually not necessary in a rank test.
As result (in both cases) we obtain the p-value, p = 0.156, and since p>0.05 (0.05 is the default significance level), the null hypothesis cannot be rejected (h = 0).
Therefore, the difference is declared not significant.
Now let us perform the same test using the Imprecise Dirichlet Process (IDP).
The details of the test can be found in this [paper] and to run the test you need to download and run the code [here].In R, this test can be performed as follows
isignrank.test(T14,T13,"greater")
while in Matlab:
[prob,h]=isignrank(T13,T14,'tail','right','alpha',0.05);
The result is shown (in both cases) in the below figure. The main differences w.r.t. the classical Wilcoxon signed-rank test are: (i) the test is Bayesian and, thus, it returns the posterior probability of the hypothesis "T13 is larger than T14"; (ii) the test is imprecise, which means that it actually returns the lower and upper probabilities of the hypothesis "T13 is larger than T14".
The lower and upper probabilities are obtained by considering the set of all possible probability base measures for the Dirichlet process. This means that the test is also robust to the choice of the probability base measure.
Looking at the figure, it can be observed that, since the upper (and, thus, the lower) probability is less than 0.95, we cannot say that "T13 is larger than T14" with posterior probability equal to 1-alpha=0.95.
The IDP based test and the Wilcoxon signed-rank test agree in this case.
However, the IDP gives us additional information: the posterior probabilities. In fact, since the lower probability is about 0.75, we can actually declare that "T13 is larger than T14" with posterior probability 1-alpha=0.75.
Therefore, we can say that my performance on 2014 is better than that on 2013, with reliability (posterior probability ) of 75%.
This means that the result of the hypothesis test is prior independent, i.e., it changes with the choice of the prior base measure of the Dirichlet process. In other words, this means that the evidence from the observations is not enough to declare either that the probability of the hypothesis being true is larger or smaller than the desired value 1 − alpha (the result is prior dependent); more measurements are necessary to take a decision.
Friday, May 9, 2014
Battle for White House 2012
Battle for White House 2012 - 2 weeks before election
The statistical analysis has been performed by using the most recent (2 weeks before election) polling data from realclearpolitics. The dataset can be downloaded here, while Matlab code can be downloaded here. The minimum sample size is around 500 people. The analysis employs an imprecise probability robust Bayesian approach in which robustness is evaluated with respect to the following swing scenarios:- Best for Romney: in each state the preference of c=2 people among the n polled is changed from Obama to Romney.
- Best for Obama: in each state the preference of c=2 people among the n polled is changed from Romney to Obama.
From the histogram, it can be noticed that there is a high uncertainty. Because of this uncertainty the contribution of the prior on the final result is crucial. In fact, notice that it is enough that in each state the votes of two electors among the n sampled change (they represent less than 0.4% of the total number of the polled voters) from Obama to Romney, that Romney's chance of winning reaches 50%.
The electoral maps of the two cases are reported hereafter for the "Best for Romney" and, respectively, "Best for Obama" case.
The maps show that, based on the polled data, the critical States are Ohio and Iowa. If Romney wins in these two states, his chance of winning goes from 20% to 50%.
Subscribe to:
Posts (Atom)