Monday, August 29, 2016

Combining polls data from different sources using covariance intersection

In a previous post, we have seen how to perform polls for a single State using poll data from KTNV/Rasmussen.
Here  we are going to see how to combine polls from different sources.

Let us consider again Nevada polls.

Poll Date Sample MoE Clinton (D) Trump (R) Johnson (L) Spread
0 RCP Average 7/7 - 8/5 -- -- 43 40.7 6.3 Clinton +2.3
1 CBS News/YouGov* 8/2 - 8/5 993 LV 4.6 43 41.0 4.0 Clinton +2
2 KTNV/Rasmussen 7/29 - 7/31 750 LV 4.0 41 40.0 10.0 Clinton +1
3 Monmouth 7/7 - 7/10 408 LV 4.9 45 41.0 5.0 Clinton +4


Instead of doing an average of the poll as it is done by RCP (RealClearPolitics), we use Covariance Intersection. Covariance intersection is an algorithm for combining two or more data source when the correlation between them is unknown.

Let us denote with \(\hat{a}\) a vector of observations (e.g., 43,41,16 from CBS News/YouGov) and  \(\hat{b}\) another vector of observations (e.g., 41,40,19 from KTNV/Rasmussen). \(A\) denotes the reliability of the data poll \(\hat{a}\) that we assume to be equal  \(1/sample size\) (e.g., 1/993
for CBS News/YouGov) and \(B\) denotes the reliability of the data poll \(\hat{b}\) (e.g., 1/750 for KTNV/Rasmussen).

Given the weight \(\omega\),Covariance Intersection provides a formula to combine them: 


$$ C^{{-1}}=\omega A^{{-1}}+(1-\omega )B^{{-1}}\,, $$ $$ \hat{c} =C(\omega A^{{-1}}{\hat a}+(1-\omega )B^{{-1}}{\hat b})\,. $$

This formula can be extended to an arbitrary number of sources.  For instance, for the previous table using  uniform weights  \(\omega_1=1/3,\omega_3=1/3,\omega_3=1/3\), we get

$$ C^{{-1}}=\omega_1 993+\omega_2 750+\omega_3 408=717$$ $$ \hat{c} =C(\omega_1 993{[43,41,16]}+\omega_2 750[41,40,19]+\omega_3 408[45,41,14])\,. $$

The final result is

$$
C^{{-1}}=717,  ~~~\hat{c}=[42.68,   40.65,   16.67]
$$

It can be observed that by using \(\omega_1=1/3,\omega_3=1/3,\omega_3=1/3\) the combined poll \(\hat{c}\) reduces to the average of the input polls
weighted by the sample size. However, it is possible to choose other values of the weights, see for instance here.

No comments:

Post a Comment