# Blog: On the drawbacks of FedAvg and the benefits of Posterior Averaging

### A visual study of FedAvg and the Posterior in Federated Learning

Author | Date | Vacancy | Reading time |
---|---|---|---|

Michael Feil | July 14th 2021 | Federated Learning in Healthcare, TU Munich | 20min |

## Table of Contents:

**Part 0**: Introduction**Part 1**: Create non-iid client distributions**Part 2**. Apply gradient descent for one client- 2.1: Right on track: gradient descent ftw!
- 2.2: Taking the detour: an intro to gradient descent in python
- 2.3: Gradient descent for both clients and the global distribution
**Part 3**: FedAvg with non-iid client data- 3.1: Taking the detour: An programming intro to FedAvg
- 3.2: Back on track: Federated Averaging
- 3.3: How to mess with Federated Averaging
**Part 4**: Our salvation: Posterior Averaging?- 4.1: FedPA: Beyond the simplified implementation
**Part 5**: Summary**Part 6**: References

Hint: For interactivity, better open this blog post on Binder.org. Simply press on this logo:

**Part 0**: Introduction

Think about the following scenario:

You have personal images on your smartphone. So do your friends. (of course they do, we live in 2021!) As a Machine Learning engineer your goal is to create a machine learning model, which does a good job at the classification of your images from your phone and automatically clusters them in various folders, e.g. “cats” and “beergarden munich”. You also plan to help some of your friends sorting their images.

In the past (or better say before 2017), your proposal would have been, that everyone of your friends would have uploaded their images to your cloud-based server. Then all of images would get aggregated, and you would train and test a ML model. There was just one drawback: None of our friends wanted to share their sensitive image data! - But now you have the answer: **Federated Learning (FL)**!

The introduction of **Federated Averaging (FedAvg)** by McMahan et al.[2] in 2017 enables us now, to train a model without uploading any images to any kind of central server. Isn’t that great!?! Let’s look at the images below.

As in the Figure1 above, in the case of **cross-device** FL with FedAvg, our cloud-based central server will send an initial state of the model parameters to thousands of mobile devices at once. The clients then update their version of the model parameters using **stochastic gradient descent (SGD)**. By doing more SGD steps, we train more each **communication round**, and therefore improve the **communication efficiency**. The updated model parameters get communicated back to the central server, where the client parameters get aggregated using **weighted averaging**. The process then starts again into the next communication round with the slightly improved parameters of our model.

### Executive Summary

In the following sections, we will discuss a suitable heterogenous data distribution for the clients Part 1, code and visualize gradient descent and derive the global posterior in Part 2. In Part 3, we will find some unexpected hiccups using FedAvg[2] and solve them by introducing a FedPA[1] in Part 4.

```
# import some utilities
from computation_fedavg_gd import grad_descent, partial_derivative
from computation_fedavg_gd import MultivariateGaussian
from computation_fedavg_gd import PDFsManipulate
# import the usual suspects (excluding pandas as pd)
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
```

**Part 1**: Create non-iid client distributions

First step: For our scenario from the introduction, we model two phones (clients) by selecting two suitable client distributions. The distributions symbolize our target function on each client, which we try to optimize. For this notebook, we plot with the following assumptions:

- the model has just 2 tunable weights $\theta$
- there are 2 clients.
- The $\theta$ over each target function $J$ has a (Multivariate) Gaussian distribution ($\mu$,$\Sigma$).
- since the data distribution of our clients differs, we assume a non independent and identically distributed $\theta$ (
**non-iid**) (i.e. we pick different parameters $\mu$, $\Sigma$ for each client).

Further reading: The Multivariate Gaussian is defined in `computation_fedavg_gd.py`

according to this formula:

```
# If you are running this notebook blog interactivly on binder.org
# after reading through everything, you can change the parameters here.
mu1, sigma1 = np.array([0, 1]), np.array([[1.0, -0.9], [-0.5, 1.5]])
mu2, sigma2 = np.array([0, -1.2]), np.array([[1.5, 0.6], [0.9, 1.0]])
multivariate1 = MultivariateGaussian(
mu=mu1, sigma=sigma1
)
multivariate2 = MultivariateGaussian(
mu=mu2, sigma=sigma2
)
```

```
# calculate at our target function over a sampling grid
from plot_fedavg_gd import SAMPLE_GRID_VECTOR
# sample over a grid from -3,5 to 3,5 in 2D
J1 = multivariate1.evaluate(theta_vec=SAMPLE_GRID_VECTOR)
J2 = multivariate2.evaluate(theta_vec=SAMPLE_GRID_VECTOR)
print(f"evaluated gaussian over a 2D sampling grid {SAMPLE_GRID_VECTOR.shape} for a third target vector {J1.shape}")
```

```
evaluated gaussian over a 2D sampling grid (71, 71, 2) for a third target vector (71, 71)
```

Lets create some fancy 3D-plots for our two client distributions. Note how the two distributions have different optima in our non-iid setting.

```
%matplotlib notebook
# lets visualize our distributions
def client_function_draw(eval_sample, cmap = matplotlib.cm.Greens):
# adjust some 3D settings
fig = plt.figure()
ax = fig.gca(projection="3d")
ax.view_init(elev=20.0, azim=-125)
ax.dist = 7.5
ax.set_zticks(np.linspace(0, 0.3, 7))
# labels
ax.set_xlabel(r"${\theta}_0$")
ax.set_ylabel(r"${\theta}_1$")
ax.set_zlabel(r"J(${\theta}_0$, ${\theta}_1$)")
# plot a 3D curve
ax.plot_wireframe(
SAMPLE_GRID_VECTOR[:,:,0],
SAMPLE_GRID_VECTOR[:,:,1],
eval_sample,
rstride=3,
cstride=3,
linewidth=0.5,
antialiased=True,
cmap=matplotlib.cm.Greens,
zorder=1,
)
# make a 2D contour plot in cmap color on the bottom
ax.contourf(
SAMPLE_GRID_VECTOR[:,:,0],
SAMPLE_GRID_VECTOR[:,:,1],
eval_sample,
zdir="z",
offset=-np.max(eval_sample)/20,
cmap=cmap,
zorder=0,
)
return ax
print("**client 1: **")
fig = client_function_draw(J1)
print("**client 2: **")
fig = client_function_draw(J2, cmap="Oranges")
```

```
**client 1: **
```

```
**client 2: **
```

**Part 2**. Apply gradient descent for one client

Second step: Awesome, we have the parameter spaces for our two image recognition models. Let’s do some training, every client simply by himself, with its own local data. If you don’t know how gradient descent is implemented here, a quick detour to 2.2. might help! :)

## 2.1 Right on track: gradient descent ftw!

Let’s see how this would look like, with Gradient Descent applied.

```
# lets start with some random values for theta. come back later, and tune them, if you like.
THETA = np.array([-2.0, 0.1])
```

```
%matplotlib notebook
# apply gradient descent
history_client1 = grad_descent(
function_descent=multivariate1.evaluate,
theta=THETA,
eps=1e-07, # stop condition
nb_max_iter=1000, # max iterations,
verbose=1,
)
ax = client_function_draw(J1)
# plot the gradient descent
ax.scatter(
history_client1[:, 0][::10],
history_client1[:, 1][::10],
history_client1[:, 2][::10],
label="GD descent",
c="red",
lw=5,
zorder=100,
)
fig = ax.legend()
```

```
iter: 0, theta: [-1.9, gradient: 3.313123563351253e-05
iter: 50, theta: [-1.6, gradient: 0.00018967341849882979
iter: 100, theta: [-0.9, gradient: 0.0025225717318137636
iter: 150, theta: [-0.2, gradient: 0.00034660933191360543
iter: 200, theta: [-0.0, gradient: 1.3755523300240657e-05
iter: 250, theta: [-0.0, gradient: 3.125739587883647e-06
iter: 300, theta: [-0.0, gradient: 8.610496716188187e-07
iter: 350, theta: [-0.0, gradient: 2.3839203919240326e-07
stop condition reached after 384 iterations
```

Awesome. In the next field, we also visualize the second client, together with our first. This gives us an overview about the shifted distributions and our non-iid setting.

```
from plot_fedavg_gd import mpl_multivariate_3d_gd
print("**client 1**")
# settings of plotting client 1
fig = mpl_multivariate_3d_gd(
name_labels=["GD client 1"],
colors=["green"],
functions=[multivariate1.evaluate],
target_function=[J1],
cmap_target=[ "Greens"],
label_target=[ "client 1 target"],
theta=THETA,
)
print("**client 2**")
fig = mpl_multivariate_3d_gd(
name_labels=["GD client 2", ],
colors=["orange", ],
functions=[multivariate2.evaluate],
target_function=[J2],
cmap_target=["Oranges"],
label_target=["client 2 target"],
theta=THETA,
)
```

```
**client 1**
create 3D plot using SGD on [('GD client 1', 'green')] over distribution [('client 1 target', 'Greens')]
```

```
**client 2**
create 3D plot using SGD on [('GD client 2', 'orange')] over distribution [('client 2 target', 'Oranges')]
```

As you can see, each client problem by itself is solved. In our case, the challenge however is, to find the global optimum of a global model. By searching for one single and global model, we have more data availabe to fit our model to. We therefore ‘hope’ to generalize better to new clients in the future, or get a better overall performance on our exisiting clients.

## 2.2 Taking the detour: an intro to gradient descent in Python

If you are not in a hurry, you can see in this chapter how gradient descent has been implemented.

We optimize one parameter at a time. (line-search)

- for all $\theta_i$ in vector $\theta$ do: \(\theta_{i, new} \leftarrow \theta_{i}- \alpha_{lr} \nabla \theta_{i}\)

GD: linea-search for n iterations over one function

```
# Usage of line-search: optimize one parameter at a time
partial_derivative??
```

```
# Evaluating gradient descent for one function:
grad_descent??
```

**2.3**: Deriving the global posterior distribution

Lets add two target functions client1 + client2 together! We do as if we could aggregate our clients images, and could train a central model without Federated Learning.

We assume to have **a uniform prior** for the weight distributions of our clients. That way, our global posterior distribution does not depend on any client priors and exactly decomposes into a product of local posteriors. [1] \(\mathbb{P}_{global}(\boldsymbol{\theta} \mid D) \propto \prod_{i=1}^{N} \mathbb{P}\left(\boldsymbol{\theta} \mid D_{i}\right)\)

We further assumed to have two Gaussian distrbutions on our clients. Luckily, we can look up in the Matrix Cookbook [5] (Section 8.1.8), how to calculate the global posterior from a product of gaussians:

**Posterior: A Product of gaussian densities**

Let $\mathcal{N}_{\mathbf{x}}(\mathbf{\mu}, \mathbf{\Sigma})$ denote a density of $\mathbf{x}$, then \(\begin{aligned} c_{c} \mathcal{N}_{\mathbf{x, global}}\left(\mathbf{\mu}_{c}, \mathbf{\Sigma}_{c}\right) = \mathcal{N}_{\mathbf{x, client1}}\left(\mathbf{\mu}_{1}, \mathbf{\Sigma}_{1}\right) \cdot \mathcal{N}_{\mathbf{x, client2}}\left(\mathbf{\mu}_{2}, \mathbf{\Sigma}_{2}\right) \\ \end{aligned}\)

Awesome, the product is again a Gaussian. All we need to do is to compute the normalization constant $c_{c}$ , plus our $ \mathbf{\mu}*{c}$ and $ \mathbf{\Sigma}*{c} $ of the new distribution:

The normalization constant is optionally calculated using the following equation:

\[\begin{aligned} c_{c}=\mathcal{N}_{x = \mathbf{\mu}_{1}}\left(\mathbf{\mu}_{2},\left(\boldsymbol{\Sigma}_{1}+\boldsymbol{\Sigma}_{2}\right)\right) = \frac{1}{\sqrt{\operatorname{det}\left(2 \pi\left(\boldsymbol{\Sigma}_{1}+\mathbf{\Sigma}_{2}\right)\right)}} \exp \left[-\frac{1}{2}\left(\mathbf{\mu}_{1}-\mathbf{\mu}_{2}\right)^{T}\left(\boldsymbol{\Sigma}_{1}+\boldsymbol{\Sigma}_{2}\right)^{-1}\left(\mathbf{\mu}_{1}-\mathbf{\mu}_{2}\right)\right] \\ \end{aligned}\]**Posterior: Numeric calculation by multiplication**

But we could also do it numerically, by just calculating: \(\mathbb{P}_{global} = \frac{(multivariate1.evaluate * multivariate2.evaluate)} {c_{normalize}}\)

```
# Numeric calculation of our Posterior
# basically does (multivariate1 + multivariate2) / norm
f_posterior_num = PDFsManipulate(functions=[multivariate1.evaluate, multivariate2.evaluate], method = np.multiply)
# evaluate on SAMPLE_GRID
J = f_posterior_num.evaluate(theta_vec = SAMPLE_GRID_VECTOR)
```

```
# Exact calculation of our Posterior, using the equations above.
inv_sigma1 = np.linalg.inv(sigma1)
inv_sigma2 = np.linalg.inv(sigma2)
sum_inv = np.linalg.inv(inv_sigma1 + inv_sigma2)
sigma_global_posterior = sum_inv
mu_global_posterior = sum_inv @ (inv_sigma1 @ mu1 + inv_sigma2 @ mu2)
# using the math above
f_posterior_exact = MultivariateGaussian(
mu=mu_global_posterior,
sigma=sigma_global_posterior
)
# evaluate on SAMPLE_GRID
J_exact = f_posterior_exact.evaluate(theta_vec = SAMPLE_GRID_VECTOR)
```

```
print("argmax J_exact", np.where(J_exact==J_exact.max()),J_exact.max())
print("argmax J_num", np.where(J==J.max()),J.max())
ax = client_function_draw(J_exact, cmap="Purples")
# ax = client_function_draw(J, cmap="Purples")
```

```
argmax J_exact (array([33], dtype=int64), array([42], dtype=int64)) 0.39885406077586116
argmax J_num (array([32], dtype=int64), array([41], dtype=int64)) 0.3949362773624632
```