Skip to contents

What the gips is based on

The package is based on the article [1]. There the math behind the package is precisely demonstrated, and all the theorems are proven.

In this vignette, we would like to give a gentle introduction. We want to point out all the most important results from this work from the user’s point of view. We will also show examples of those results in the gips package.

As mentioned in the abstract, the outline of the paper is to “derive the distribution of the maximum likelihood estimate of the covariance parameter Σ\Sigma (…)” and then to “perform Bayesian model selection in the class of complete Gaussian models invariant by the action of a subgroup of the symmetric group (…)”. Those ideas are implemented in the gips package.

Alternative reference

The theory derived in [1] is for general group invariance, while in this package, we only consider invariance under cyclic groups. This allows for massive simplifications. This simplified version is comprehensibly set out in [2].

Basic definitions

Let V={1,,p}V=\{1,\ldots,p\} be a finite index set, and for every i{1,,n}i\in \{1, \dots, n\}, Z(i)=(Z1(i),,Zp(i))Z^{(i)}=(Z_1^{(i)},\ldots, Z_p^{(i)})^\top be a multivariate random variable following a centered Gaussian model Np(0,Σ)\mathrm{N}_p(0,\Sigma), and let Z(1),,Z(n)Z^{(1)},\ldots, Z^{(n)} be an i.i.d. (independent and identically distributed) sample from this distribution. Name the whole sample Z=(Z(1),,Z(n))Z = (Z^{(1)},\ldots, Z^{(n)}).

Let 𝔖p\mathfrak{S}_p denote the symmetric group on VV, that is, the set of all permutations on {1,,p}\{1,\ldots,p\} with function composition as the group operation. Let Γ\Gamma be an arbitrary subgroup of 𝔖p\mathfrak{S}_p. The model Np(0,Σ)\mathrm{N}_p(0,\Sigma) is said to be invariant under the action of Γ\Gamma if for all gΓg\in \Gamma, gΣg=Σg\cdot\Sigma\cdot g^\top=\Sigma (here, we identify a permutation gg with its permutation matrix).

For a subgroup Γ𝔖p\Gamma \subset \mathfrak{S}_p, we define the colored space, i.e., the space of symmetric matrices invariant under Γ\Gamma, 𝒵Γ:={SSym(p;):Si,j=Sσ(i),σ(j) for all σΓ for all i,jV},\mathcal{Z}_{\Gamma} := \{S \in \mathrm{Sym}(p;\mathbb{R})\colon S_{i,j} = S_{\sigma(i),\sigma(j)} \text{ for all }\sigma \in \Gamma\mbox{ for all }i,j\in V\}, and the colored cone of positive definite matrices valued in 𝒵Γ\mathcal{Z}_{\Gamma}, 𝒫Γ:=𝒵ΓSym+(p;).\mathcal{P}_{\Gamma} := \mathcal{Z}_{\Gamma} \cap \mathrm{Sym}^+(p;\mathbb{R}).

Block Decomposition - [1], Theorem 1

The main theoretical result in this theory (Theorem 1 in [1]) states that given a permutation subgroup Γ\Gamma there exists an orthogonal matrix UΓU_\Gamma such that all the symmetric matrices S𝒵ΓS\in\mathcal{Z}_\Gamma can be transformed into block-diagonal form.

The exact form of blocks depends on so-called structure constants (ki,di,ri)i=1L(k_i,d_i,r_i)_{i=1}^L. It is worth pointing out that constants k=dk = d for cyclic group Γ=σ\Gamma = \left<\sigma\right> and that gips searches within cyclic subgroups only.

Examples

p <- 6
S <- matrix(c(
  1.1, 0.9, 0.8, 0.7, 0.8, 0.9,
  0.9, 1.1, 0.9, 0.8, 0.7, 0.8,
  0.8, 0.9, 1.1, 0.9, 0.8, 0.7,
  0.7, 0.8, 0.9, 1.1, 0.9, 0.8,
  0.8, 0.7, 0.8, 0.9, 1.1, 0.9,
  0.9, 0.8, 0.7, 0.8, 0.9, 1.1
), nrow = p)

S is a symmetric matrix invariant under the group Γ=(1,2,3,4,5,6)\Gamma = \left<(1,2,3,4,5,6)\right>.

g_perm <- gips_perm("(1,2,3,4,5,6)", p)
U_Gamma <- prepare_orthogonal_matrix(g_perm)

block_decomposition <- t(U_Gamma) %*% S %*% U_Gamma
round(block_decomposition, 5)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]  5.2  0.0  0.0  0.0  0.0  0.0
#> [2,]  0.0  0.5  0.0  0.0  0.0  0.0
#> [3,]  0.0  0.0  0.5  0.0  0.0  0.0
#> [4,]  0.0  0.0  0.0  0.1  0.0  0.0
#> [5,]  0.0  0.0  0.0  0.0  0.1  0.0
#> [6,]  0.0  0.0  0.0  0.0  0.0  0.2

The transformed matrix is in the block-diagonal form of [1], Theorem 1. Blank entries are off-block entries and equal to 0. Notice that, for example, the [2,3] is not blank regardless of being 0. This is because it is a part of the block-diagonal form but happens to have a value of 0.

The result was rounded to the 5th place after the decimal to hide the inaccuracies of floating point arithmetic.

Let’s see the other example:

p <- 6
S <- matrix(c(
  1.2, 0.9, 0.9, 0.4, 0.2, 0.1,
  0.9, 1.2, 0.9, 0.1, 0.4, 0.2,
  0.9, 0.9, 1.2, 0.2, 0.1, 0.4,
  0.4, 0.1, 0.2, 1.2, 0.9, 0.9,
  0.2, 0.4, 0.1, 0.9, 1.2, 0.9,
  0.1, 0.2, 0.4, 0.9, 0.9, 1.2
), nrow = p)

Now, S is a symmetric matrix invariant under the group Γ=(1,2,3)(4,5,6)\Gamma = \left<(1,2,3)(4,5,6)\right>.

g_perm <- gips_perm("(1,2,3)(4,5,6)", p)
U_Gamma <- prepare_orthogonal_matrix(g_perm)

block_decomposition <- t(U_Gamma) %*% S %*% U_Gamma
round(block_decomposition, 5)
#>      [,1] [,2]   [,3]    [,4]    [,5]   [,6]
#> [1,]  3.0  0.7 0.0000  0.0000  0.0000 0.0000
#> [2,]  0.7  3.0 0.0000  0.0000  0.0000 0.0000
#> [3,]  0.0  0.0 0.3000  0.0000  0.2500 0.0866
#> [4,]  0.0  0.0 0.0000  0.3000 -0.0866 0.2500
#> [5,]  0.0  0.0 0.2500 -0.0866  0.3000 0.0000
#> [6,]  0.0  0.0 0.0866  0.2500  0.0000 0.3000

Again, this result is in accordance with [1], Theorem 1. Notice the zeros in block_decomposition: i{1,2},j{3,4,5,6}block_decomposition[i,j]=0\forall_{i\in\{1,2\},j\in\{3,4,5,6\}}\text{block_decomposition}[i,j] = 0

Project Matrix - [1, Eq. (6)]

One can also take any symmetric square matrix S and find the orthogonal projection on 𝒵Γ\mathcal{Z}_{\Gamma}, the space of matrices invariant under the given permutation:

πΓ(S):=1|Γ|σΓσSσ\pi_\Gamma(S) := \frac{1}{|\Gamma|}\sum_{\sigma\in\Gamma}\sigma\cdot S\cdot\sigma^\top

The projected matrix is the element of the cone πΓ(S)𝒵Γ\pi_\Gamma(S)\in\mathcal{Z}_{\Gamma}, which means: i,j{1,,p}πΓ(S)[i,j]=πΓ(S)[σ(i),σ(j)] for all σΓ\forall_{i,j\in \{1,\ \dots,\ p\}} \pi_\Gamma(S)[i,j] = \pi_\Gamma(S)[\sigma(i),\sigma(j)] \text{ for all }\sigma\in\Gamma

So it has some identical elements.

Trivial case

Note that for Γ={id}={(1)(2)(p)}\Gamma = \{\text{id}\} = \{(1)(2)\dots(p)\} we have π{id}(S)=S\pi_{\{\text{id}\}}(S) = S.

So, no additional assumptions are made; thus, the standard covariance estimator is the best we can do.

Notation

We will abbreviate the notation: when the Γ=c\Gamma = \left< c \right> is a cyclic group of a permutation cc, we will write πc(S):=πΓ(S)=πc(S)\pi_{c}(S) := \pi_{\Gamma}(S) = \pi_{\left< c \right>}(S).

Example

Let S be any symmetric square matrix:

round(S, 2)
#>        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]
#> [1,] 137.51 -16.21  10.03   0.16 -24.35 -17.42
#> [2,] -16.21  34.08 -10.62  15.93  12.23  -2.74
#> [3,]  10.03 -10.62  35.47   3.10  -3.81  -9.60
#> [4,]   0.16  15.93   3.10  26.74   7.71 -13.51
#> [5,] -24.35  12.23  -3.81   7.71  26.00  -7.24
#> [6,] -17.42  -2.74  -9.60 -13.51  -7.24  16.77

One can project this matrix, for example, on Γ=perm=(1,2)(3,4,5,6)\Gamma = \left< \text{perm} \right> = \left<(1,2)(3,4,5,6)\right>:

S_projected <- project_matrix(S, perm = "(1,2)(3,4,5,6)")
round(S_projected, 2)
#>        [,1]   [,2]  [,3]  [,4]  [,5]  [,6]
#> [1,]  85.80 -16.21 -0.28 -3.91 -0.28 -3.91
#> [2,] -16.21  85.80 -3.91 -0.28 -3.91 -0.28
#> [3,]  -0.28  -3.91 26.25 -1.51 -8.66 -1.51
#> [4,]  -3.91  -0.28 -1.51 26.25 -1.51 -8.66
#> [5,]  -0.28  -3.91 -8.66 -1.51 26.25 -1.51
#> [6,]  -3.91  -0.28 -1.51 -8.66 -1.51 26.25

Notice in the S_projected matrix there are identical elements according to the equation from the beginning of this section. For example, S_projected[1,1] = S_projected[2,2].

CσC_\sigma and n0

It is a well-known fact that without additional assumptions, the Maximum Likelihood Estimator (MLE) of the covariance matrix in the Gaussian model exists if and only if npn \ge p. However, if the additional assumption is added as the covariance matrix is invariant under permutation σ\sigma, then the sample size nn required for the MLE to exist is lower than pp. It is equal to the number of cycles, denoted hereafter by CσC_\sigma.

For example, if the permutation σ=(1,2,3,4,5,6)\sigma = (1,2,3,4,5,6) is discovered by the find_MAP() function, then there is a single cycle in it Cσ=1C_\sigma = 1. Therefore a single observation would be enough to estimate a covariance matrix with project_matrix(). If the permutation σ=(1,2)(3,4,5,6)\sigma = (1,2)(3,4,5,6) is discovered, then Cσ=2C_\sigma = 2, and so 2 observations would be enough.

To get this CσC_\sigma number in gips, one can call summary() on the appropriate gips object:

g1 <- gips(S, n, perm = "(1,2,3,4,5,6)", was_mean_estimated = FALSE)
summary(g1)$n0
#> [1] 1
g2 <- gips(S, n, perm = "(1,2)(3,4,5,6)", was_mean_estimated = FALSE)
summary(g2)$n0
#> [1] 2

This is called n0 and not CσC_\sigma because it is increased by 1 when the mean was estimated:

S <- cov(Z)
g1 <- gips(S, n, perm = "(1,2,3,4,5,6)", was_mean_estimated = TRUE)
summary(g1)$n0
#> [1] 2
g2 <- gips(S, n, perm = "(1,2)(3,4,5,6)", was_mean_estimated = TRUE)
summary(g2)$n0
#> [1] 3

Bayesian model selection

When one has the data matrix Z, one would like to know if it has a hidden structure of dependencies between features. Luckily, the paper demonstrates a way how to find it.

General workflow

  1. Choose the prior distribution on Γ\Gamma and Σ\Sigma.
  2. Calculate the posteriori distribution (up to a normalizing constant) by the formula [1], (30).
  3. Use the Metropolis-Hastings algorithm to find the permutation with the biggest value of the posterior probability (Γ|Z)\mathbb{P}(\Gamma|Z).

Details on the prior distribution

The considered prior distribution of Γ\Gamma and K=Σ1K=\Sigma^{-1}:

  1. Γ\Gamma is uniformly distributed on the set of all cyclic subgroups of 𝔖p\mathfrak{S}_p.
  2. KK given Γ\Gamma follows the Diaconis-Ylvisaker conjugate prior distribution with parameters δ\delta (real number, δ>1\delta > 1) and DD (symmetric, positive definite square matrix of the same size as S), see [1], Sec. 3.4.

Footnote: Actually, for Γ={id}\Gamma = \{id\}, δ>0\delta > 0 parameters are theoretically correct. In gips, we want this to be defined for all cyclic groups Γ\Gamma, so we restrict δ>1\delta > 1. Refer to the [1].

gips technical details

In gips, δ\delta is named delta, and DD is named D_matrix. By default, they are set to 33 and diag(d, p), respectively, where d = mean(diag(S)). However, it is worth running the procedure for several parameters D_matrix of form ddiag(p)d\cdot diag(p) for positive constant dd. Small dd (compared to the data) favors small structures. Large dd will “forget” the data.

One can calculate the logarithm of formula (30) with the function log_posteriori_of_gips().

Interpretation

When all assumptions are met, the formula (30) puts a number on each permutation’s cyclic group. The bigger its value, the more likely the data was drawn from that model.

When one finds the permutations group cmaxc_{\text{max}} that maximizes (30), cmap=arg maxc𝔖p(Γ=c|Z(1),,Z(n))c_{\text{map}} = \operatorname{arg\,max}_{c\in\mathfrak{S}_p} \mathbb{P}\left(\Gamma=c|Z^{(1)},\ldots,Z^{(n)}\right)

one can reasonably assume the data ZZ was drawn from the model Np(0,πcmap(S))\mathrm{N}_p(0,\pi_{c_{\text{map}}}(S))

where S=1ni=1nZ(i)Z(i)S = \frac{1}{n} \sum_{i=1}^n Z^{(i)}\cdot {Z^{(i)}}^\top

In such a case, we call cmapc_{\text{map}} the Maximum A Posteriori (MAP).

Finding the MAP Estimator

The space of all permutations is enormous for bigger pp (in our experiments, p10p\ge 10 is too big). In such a big space, estimating the MAP is more reasonable than calculating it precisely.

Metropolis-Hastings algorithm suggested by the authors of [1] is a natural way to do it. To see the discussion on it and other options available in gips, see vignette("Optimizers", package="gips") or its pkgdown page.

Example

Let’s say we have this data, Z. It has dimension p=6p=6 and only 44 observations. Let’s assume Z was drawn from the normal distribution with the mean (0,0,0,0,0,0)(0,0,0,0,0,0). We want to estimate the covariance matrix:

dim(Z)
#> [1] 4 6
number_of_observations <- nrow(Z) # 4
p <- ncol(Z) # 6

# Calculate the covariance matrix from the data (assume the mean is 0):
S <- (t(Z) %*% Z) / number_of_observations

# Make the gips object out of data:
g <- gips(S, number_of_observations, was_mean_estimated = FALSE)

g_map <- find_MAP(g, optimizer = "brute_force")
#> ================================================================================
print(g_map)
#> The permutation (1,2,3,4,5,6):
#>  - was found after 362 posteriori calculations;
#>  - is 133.158 times more likely than the () permutation.

S_projected <- project_matrix(S, g_map)

We see the posterior probability [1,(30)] has the biggest value for the permutation (1,2,3,4,5,6)(1,2,3,4,5,6). It was over 100 times bigger than for the trivial id=(1)(2)(p)\text{id} = (1)(2)\ldots(p) permutation. We interpret that under the assumptions (centered Gaussian), it is over 100 times more reasonable to assume the data Z was drawn from model Np(0,S_projected)\mathrm{N}_p(0,\text{S_projected}) than from model Np(0,S)\mathrm{N}_p(0,\text{S}).

Information Criterion - AIC and BIC

One may be interested in Akaike’s An Information Criterion (AIC) or Schwarz’s Bayesian Information Criterion (BIC) of the found model. Those are defined based on log-Likelihood:

logL(Σ;Z(1),,Z(n))=i=1n(p2log(2π)12log(det(Σ))12Z(i)Σ1Z(i))=\log L\left(\Sigma; Z^{(1)},\ldots,Z^{(n)}\right) = \sum_{i=1}^n \left(- \frac{p}{2}\log (2\pi) - \frac{1}{2}\log\left( \det\left( \Sigma\right)\right) - \frac12 {Z^{(i)}}^\top \Sigma^{-1} Z^{(i)}\right)=

np2log(2π)n2log(det(Σ))n2tr(Σ1S),- \frac{np}{2}\log (2\pi) - \frac{n}{2}\log\left( \det\left( \Sigma\right)\right) - \frac{n}2\mathrm{tr}(\Sigma^{-1} S), where S=1ni=1nZ(i)Z(i)S = \frac{1}{n} \sum_{i=1}^n Z^{(i)}\cdot {Z^{(i)}}^\top.

The MLE of Σ\Sigma in a model invariant under Γ=c\Gamma=c is Σ̂=πc(S)\hat{\Sigma} = \pi_{c}(S). Further, for every cc we have tr(πc(S)1S)=p\mathrm{tr}(\pi_{c}(S)^{-1} \cdot S) = p, so:

logL(πc(S);Z(1),,Z(n))=np2log(2π)n2log(det(πc(S)))np2\log L\left(\pi_{c}(S); Z^{(1)},\ldots,Z^{(n)}\right) = - \frac{np}{2}\log (2\pi) - \frac{n}{2}\log\left( \det\left( \pi_{c}(S)\right)\right) - \frac{np}2

which can be calculated by logLik.gips().

Then AIC and BIC are defined by:

AIC=2(dimM)2logL(πc(S))AIC = 2 \cdot (\dim M) -2 \log L(\pi_{c}(S))BIC=(logn)(dimM)2logL(πc(S))BIC = (\log n) \cdot (\dim M) -2 \log L(\pi_{c}(S))

A smaller value of the criteria for a given model indicates a better fit.

Those can be calculated by AIC.gips() and BIC.gips().

Estimated mean

When the mean was estimated, we have S=1n1i=1n(Z(i)Z)(Z(i)Z)S = \frac{1}{n-1} \sum_{i=1}^n (Z^{(i)} - \bar{Z})\cdot ({Z^{(i)} - \bar{Z})}^\top, where Z=1ni=1nZ(i)\bar{Z} = \frac{1}{n} \sum_{i=1}^n Z^{(i)}. Then in the logL\log L we use n1n-1 in stead of nn. Definitions of AIC and BIC stay the same.

Example

Consider an example similar to one in the Bayesian model selection section:

Let’s say we have this data, Z. It has dimension p=6p=6 and 77 observations. Let’s assume Z was drawn from the normal distribution with the mean (0,0,0,0,0,0)(0,0,0,0,0,0). We want to estimate the covariance matrix:

dim(Z)
#> [1] 7 6
number_of_observations <- nrow(Z) # 7
p <- ncol(Z) # 6

S <- (t(Z) %*% Z) / number_of_observations

g <- gips(S, number_of_observations, was_mean_estimated = FALSE)
g_map <- find_MAP(g, optimizer = "brute_force")
#> ================================================================================
AIC(g)
#> [1] 64.19906
AIC(g_map) # this is smaller, so this is better
#> [1] 62.99751

BIC(g)
#> [1] 63.06318
BIC(g_map) # this is smaller, so this is better
#> [1] 62.78115

We will consider a g_map better model both in terms of the AIC and the BIC.

References

[1] Piotr Graczyk, Hideyuki Ishi, Bartosz Kołodziejek, Hélène Massam. “Model selection in the space of Gaussian models invariant by symmetry.” The Annals of Statistics, 50(3) 1747-1774 June 2022. arXiv link; DOI: 10.1214/22-AOS2174

[2] “Learning permutation symmetries with gips in R” by gips developers Adam Chojecki, Paweł Morgen, and Bartosz Kołodziejek, available on arXiv:2307.00790.