Contingency Tables in SQL Server 2012

Contingency tables are used to examine the relationship between subjects' scores on two qualitative or categorical variables. They show the actual and expected distribution of cases in a cross-tabulated (pivoted) format for the two variables. Here is an example of the actual and expected distribution of cases over the Gender column (on rows) and the MaritalStatus column (on columns) of the dbo.vTargetMail view from the AdvetureWorksDW2012 demo database:

Actual
Gender	Married	Single	Total
F	4745	4388	9133
M	5266	4085	9351
Total	10011	8473	18484
Expected
Gender	Married	Single	Total
F	4946	4187	9133
M	5065	4286	9351
Total	10011	8473	18484

If the columns are not contingent on the rows, then the rows and column frequencies are independent. The test of whether the columns are contingent on the rows is called the chi-square test of independence. The null hypothesis is that there is no relationship between row and column frequencies. Therefore, there should be no difference between the observed and expected frequencies. Contingency tables are the base for the chi-square test. However, even without the test, you might notice some relationship between two discrete variables just by seeing the contingency table.

Contingency tables are very simple to interpret, and therefore they are a very popular tool in statistics and data mining. Wouldn't it be nice to have a possibility to create such a nice contingency table checking the association between NumberCarsOwned and BikeBuyer columns from the dbo.vTargetMail like the following screenshot shows with SQL Server tools?

Of course, I would not write this blog if I would not have a solution. The screenshot above is actually a screenshot of a SQL Server 2012 Reporting Services (SSRS) report. The whole story starts with the following query.

WITH

ObservedCombination_CTE AS

(

SELECT NumberCarsOwned AS OnRows,

BikeBuyer AS OnCols,

COUNT(*) AS ObservedCombination

FROM dbo.vTargetMail

GROUP BY NumberCarsOwned, BikeBuyer

)

SELECT OnRows, OnCols, ObservedCombination

,SUM(ObservedCombination) OVER (PARTITION BY OnRows) AS ObservedOnRows

,SUM(ObservedCombination) OVER (PARTITION BY OnCols) AS ObservedOnCols

,SUM(ObservedCombination) OVER () AS ObservedTotal

,CAST(ROUND(SUM(1.0 * ObservedCombination) OVER (PARTITION BY OnRows)

* SUM(1.0 * ObservedCombination) OVER (PARTITION BY OnCols)

/ SUM(1.0 * ObservedCombination) OVER (), 0) AS INT) AS ExpectedCombination

,CAST(ROUND(100.0 * ObservedCombination /

SUM(ObservedCombination) OVER (), 0) AS INT)

AS PctTotal

,REPLICATE('*',

CAST(ROUND(100.0 * ObservedCombination /

SUM(ObservedCombination) OVER (), 0) AS INT))

AS Histogram

FROM ObservedCombination_CTE

ORDER BY OnRows, OnCols

Note that the query uses the new window functions in SQL Server 2012. The result of this query is not pivoted yet, however it contains all of the data needed.

OnRows	OnCols	Observed Combination	Observed On Rows	Observed On Cols	Observed Total	Expected Combination	PctTotal	Histogram
0	0	1551	4238	9352	18484	2144	8	********
0	1	2687	4238	9132	18484	2094	15	***************
1	0	2187	4883	9352	18484	2471	12	************
1	1	2696	4883	9132	18484	2412	15	***************
2	0	3868	6457	9352	18484	3267	21	*********************
2	1	2589	6457	9132	18484	3190	14	**************
3	0	951	1645	9352	18484	832	5	*****
3	1	694	1645	9132	18484	813	4	****
4	0	795	1261	9352	18484	638	4	****
4	1	466	1261	9132	18484	623	3	***

Next part is to create the report. Use the Matrix (OK, Tablix with row and column groups) control. The trick is what to put in the cells of the matrix. The following screenshot shows the expression used for the most detailed cell of the matrix.

Similar expression are used for other cells with data. The expression for the column total, i.e. for the cell that is the intersection of the [OnCols] column and Total row, is

=Sum(Fields!ObservedCombination.Value) & Chr(10) & Chr(13) &

Sum(Fields!ExpectedCombination.Value) & Chr(10) & Chr(13) &