Google's Weight Agnostic Neural Networks and Beyond

class: center, middle

# Google's Weight Agnostic Neural Networks and Beyond

### Stefan Magnuson / @styrmis

### Bath Machine Learning Group

---

# Overview

### This is the work of:

- [Adam Gaier](https://scholar.google.com/citations?user=GGyARB8AAAAJ&hl=en) & [David Ha](http://blog.otoro.net/)
- of the Google Brain team in Tokyo
- See https://weightagnostic.github.io for more details and interactive demos
- All results, images etc. are from their work unless otherwise specified

---

# About the speaker

- My field of research was in the application of Evolutionary Algorithms to the
  training of Neural Networks (EANN)

- My specialisation was on the role crossover might play in the evolution of NNs
  in particular

- The algorithm that we will look at today uses a crossover operator which is
    designed to mitigate the effects of the problem that I studied
--

- At the time (2010) Deep Learning and gradient descent-based methods did not
  enjoy the attention and real-world usage that they do today

- I'll briefly discuss Evolutionary Algorithms as both an alternative and
  companion to current approaches

---

# Overview

- High-level introduction of the work
- Some background on the evolution of neural networks
- Suggestions for possible future work

---

# Key Argument

## Which is more fundamental to a neural network, the weights, or the structure?

- This work argues that **structure > weights** because:
  - Structure brings inductive bias to a task, e.g. convolution in image
    processing

- Well-structured networks may perform well with even randomised shared weights

- The authors liken this to precocial species of animal, whose young possess
  certain abilities (related to survival) from birth

---

# On the inductive bias of architectures

- Convolutional networks are especially well suited to image processing

- Recent work found that even randomly-initialised CNNs can be effective for
  tasks such as superresolution, inpainting and style transfer

- Other work shows the ability of randomly-initialised LSTMs with a learned
  linear output layer to successfully perform time series prediction

- Most (all?) architectures commonly in use today have been invented rather than
  discovered by machines

---

# Precocious WANNs: Bipedal Walker

- **Left:** a hand-engineered, fully-connected deep NN with 2760 weight
  connections
--

- **Right:** a weight-agnostic NN with 44 connections that can perform the
  same bipedal walker task
--

- The WANN solves the task even when all the weights are the same, and the
  shared weight is randomly sampled

---

# Precocious WANNs: Car Racing

---

# Precocious WANNs: MNIST

- A conventional network with random initialisation will get ~10% accuracy on
  MNIST

--
- This architecture achieves much better than chance accuracy (> 80%) with
  random weights
--

- Without any weight training, the accuracy increases to > 90% when an ensemble
  of networks is formed (each with different weights)

---

# Precocious WANNs: Cart-pole swing up

---

# Cart-pole swing up (champion)

---

# How WANNs are built & trained

- Network structure and weights discovered using an Evolutionary Algorithm, **NEAT**

- The input and output layers are fixed, hidden layers formed over generations

- Relatively few constraints, allows for recurrent connections

- **Networks evaluated with a range of shared weight values**

- Hundreds or thousands of networks compete and share genetic material

- Algorithm aims to increase complexity of networks slowly over time

- Given two networks with equal performance, simpler network more likely to
  proceed to the next generation

- Weights can then (optionally) be trained using other methods

---

# How WANNs are built & trained

---

# How WANNs are built & trained

- Possible activation functions (linear, step, sin, cosine, Gaussian, tanh, sigmoid, inverse, absolute value, ReLU)

---

# NeuroEvolution of Augmenting Topologies (NEAT)

- First published in [Evolutionary Computation](http://www-mitpress.mit.edu/catalog/item/default.asp?ttype=4&tid=25) 10(2):99-127, 2002

- Starts with 'minimal' network

- Problem dependent, but generally:
  - No hidden nodes to start
  - Start with some/most inputs disconnected

- Employs mutation to increase complexity of network in small steps

- Employs what was a novel method of crossover to increase rate of success of
  the operator

- Employs speciation to avoid early convergence

- Main modification in this work appears to be the evaluation of networks
  against multiple shared weight values, to steer the search towards
  weight-agnostic topologies

---

# Recap & Results

- Key element of the work is weight agnosticism

- NEAT's role is to discover such weight-agnostic architectures

- We will now look at their results vs. some baseline results

---

# Baseline Results (MNIST)

| ANN | Test Accuracy |
|:------------------|-------------------|
| Linear Regression | 91.6%1 |
| Two-Layer CNN | 99.3%2 |

- [1] [Gradient-based learning applied to document recognition](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf)
- [2] [keras/examples/mnist_cnn.py](https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py)

---

# Their Results vs. Baseline (MNIST)

| WANN              | Test Accuracy |
|:------------------|---------------|
| Random Weight     | 82.0% ± 18.7% |
| Ensemble Weights  | 91.6%         |
| Tuned Weight      | 91.9%         |
| Trained Weights   | 94.2%         |

| ANN               | Test Accuracy |
|:------------------|---------------|
| Linear Regression | 91.6%         |
| Two-Layer CNN     | 99.3%         |

*Tuned weight* refers to the highest-performing shared weight in range `(-2, 2)`

---

# Building ensembles using a single network

- If we have a single successful network which:
  - displays a strong architectural bias towards the task
  - and can perform the task with multiple different (shared, untrained) weight values

- Then we can instantiate several networks by initialising copies with different
  weight values

- These multiple networks can then be combined in an ensemble to improve
  accuracy

- Some parallels with Bayesian or Variance NNs which sample from a distribution,
  where e.g. the variance is what is learned

---

# Possible Future Work / Ideas

- Will we observe this approach (re)discovering convolution in an image
  processing task?

- Focus on discovery of small structures which may be combined

- Deeper exploration of what it means for ensembles of WANNs to perform well
  despite randomisation of weights

- Repeat with HyperNEAT, which "can evolve neural networks with millions of
  connections and exploit geometric regularities in the task domain"

- Exploration of relative training time vs. state of the art

---

# Recent Related Work (not exhaustive)

- [Evolving the Topology of Large Scale Deep Neural Networks](http://www.human-competitive.org/sites/default/files/assuncao-paper-a.pdf), Assunção et al (2018)
 - Evolves CNNs with "state of the art" performance using topologies that are
   "unlikely to be designed by hand"
- [Evolving Deep Neural Networks](http://www.human-competitive.org/sites/default/files/miikkulainen-neural-paper.pdf), Miikkulainen et al (2017)
  - Introduces CoDeepNEAT, which optimises topology, components and
    hyper-parameters of Deep Neural Networks
  - Expects that increases in available computing power will reduce need for
    human input into network design

---

class: center, middle

# Thank you

## Questions & Discussion