class: center, middle # Google's Weight Agnostic Neural Networks and Beyond ### Stefan Magnuson / @styrmis ### Bath Machine Learning Group --- # Overview ### This is the work of: - [Adam Gaier](https://scholar.google.com/citations?user=GGyARB8AAAAJ&hl=en) & [David Ha](http://blog.otoro.net/) - of the Google Brain team in Tokyo - See https://weightagnostic.github.io for more details and interactive demos - All results, images etc. are from their work unless otherwise specified --- # About the speaker - My field of research was in the application of Evolutionary Algorithms to the training of Neural Networks (EANN) -- - My specialisation was on the role crossover might play in the evolution of NNs in particular -- - The algorithm that we will look at today uses a crossover operator which is designed to mitigate the effects of the problem that I studied -- - At the time (2010) Deep Learning and gradient descent-based methods did not enjoy the attention and real-world usage that they do today -- - I'll briefly discuss Evolutionary Algorithms as both an alternative and companion to current approaches --- # Overview - High-level introduction of the work - Some background on the evolution of neural networks - Suggestions for possible future work --- # Key Argument ## Which is more fundamental to a neural network, the weights, or the structure? -- - This work argues that **structure > weights** because: - Structure brings inductive bias to a task, e.g. convolution in image processing -- - Well-structured networks may perform well with even randomised shared weights -- - The authors liken this to precocial species of animal, whose young possess certain abilities (related to survival) from birth --- # On the inductive bias of architectures - Convolutional networks are especially well suited to image processing -- - Recent work found that even randomly-initialised CNNs can be effective for tasks such as superresolution, inpainting and style transfer -- - Other work shows the ability of randomly-initialised LSTMs with a learned linear output layer to successfully perform time series prediction -- - Most (all?) architectures commonly in use today have been invented rather than discovered by machines --- # Precocious WANNs: Bipedal Walker <center> <img src="https://1.bp.blogspot.com/-tiQpdUlaRgw/XWfqggww6GI/AAAAAAAAEmE/42XO7B9zsu4728qp3M-39JG-F9pbSf4OwCLcBGAs/s1600/unnamed.png" height="300" /> </center> - **Left:** a hand-engineered, fully-connected deep NN with 2760 weight connections -- - **Right:** a weight-agnostic NN with 44 connections that can perform the same bipedal walker task -- - The WANN solves the task even when all the weights are the same, and the shared weight is randomly sampled --- # Precocious WANNs: Car Racing <img src="https://1.bp.blogspot.com/-NO5EzLBUQ0c/XWQh8_lue8I/AAAAAAAAElY/0rXtjS0xcyQUazG0IAVvXaKaH3xReCecgCLcBGAs/s1600/image5.png" width="40%" /> <img src="https://1.bp.blogspot.com/-3vvnLpWV10o/XWQh-j8uNQI/AAAAAAAAElo/vBefmx2UOfYyAuZFmD-U1dY1KJYD7V1fwCLcBGAs/s1600/image9.gif" height="80%" /> --- # Precocious WANNs: MNIST <img src="https://storage.googleapis.com/quickdraw-models/sketchRNN/wann/png/mnist_cover.png" width="100%" /> - A conventional network with random initialisation will get ~10% accuracy on MNIST -- - This architecture achieves much better than chance accuracy (> 80%) with random weights -- - Without any weight training, the accuracy increases to > 90% when an ensemble of networks is formed (each with different weights) --- # Precocious WANNs: Cart-pole swing up <img src="https://storage.googleapis.com/quickdraw-models/sketchRNN/wann/png/swingup_bottom.png" width="100%" /> --- # Cart-pole swing up (champion) <center> <img src="https://storage.googleapis.com/quickdraw-models/sketchRNN/wann/png/champ_swingup.png" width="70%" /> </center> --- # How WANNs are built & trained - Network structure and weights discovered using an Evolutionary Algorithm, **NEAT** -- - The input and output layers are fixed, hidden layers formed over generations -- - Relatively few constraints, allows for recurrent connections -- - **Networks evaluated with a range of shared weight values** -- - Hundreds or thousands of networks compete and share genetic material -- - Algorithm aims to increase complexity of networks slowly over time -- - Given two networks with equal performance, simpler network more likely to proceed to the next generation -- - Weights can then (optionally) be trained using other methods --- # How WANNs are built & trained <img src="https://storage.googleapis.com/quickdraw-models/sketchRNN/wann/png/schematic.png" width="100%" /> --- # How WANNs are built & trained <img src="https://1.bp.blogspot.com/-X7Xap9Nz4wQ/XWQf26ALohI/AAAAAAAAEj8/mkpzsO2C3pk10fjJCZbSMTYG469uxYpgwCLcBGAs/s1600/image2.png" width="100%" /> - Possible activation functions (linear, step, sin, cosine, Gaussian, tanh, sigmoid, inverse, absolute value, ReLU) --- # NeuroEvolution of Augmenting Topologies (NEAT) - First published in [Evolutionary Computation](http://www-mitpress.mit.edu/catalog/item/default.asp?ttype=4&tid=25) 10(2):99-127, 2002 -- - Starts with 'minimal' network -- - Problem dependent, but generally: - No hidden nodes to start - Start with some/most inputs disconnected -- - Employs mutation to increase complexity of network in small steps -- - Employs what was a novel method of crossover to increase rate of success of the operator -- - Employs speciation to avoid early convergence -- - Main modification in this work appears to be the evaluation of networks against multiple shared weight values, to steer the search towards weight-agnostic topologies --- # Recap & Results -- - Key element of the work is weight agnosticism -- - NEAT's role is to discover such weight-agnostic architectures -- - We will now look at their results vs. some baseline results --- # Baseline Results (MNIST) | ANN | Test Accuracy | |:------------------|-------------------| | Linear Regression | 91.6%<sup>1</sup> | | Two-Layer CNN | 99.3%<sup>2</sup> | - [1] [Gradient-based learning applied to document recognition](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) - [2] [keras/examples/mnist_cnn.py](https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py) --- # Their Results vs. Baseline (MNIST) | WANN | Test Accuracy | |:------------------|---------------| | Random Weight | 82.0% ± 18.7% | | Ensemble Weights | 91.6% | | Tuned Weight | 91.9% | | Trained Weights | 94.2% | <br /> | ANN | Test Accuracy | |:------------------|---------------| | Linear Regression | 91.6% | | Two-Layer CNN | 99.3% | <br /> <br /> *Tuned weight* refers to the highest-performing shared weight in range `(-2, 2)` --- # Building ensembles using a single network - If we have a single successful network which: - displays a strong architectural bias towards the task - and can perform the task with multiple different (shared, untrained) weight values -- - Then we can instantiate several networks by initialising copies with different weight values -- - These multiple networks can then be combined in an ensemble to improve accuracy -- - Some parallels with Bayesian or Variance NNs which sample from a distribution, where e.g. the variance is what is learned --- # Possible Future Work / Ideas - Will we observe this approach (re)discovering convolution in an image processing task? -- - Focus on discovery of small structures which may be combined -- - Deeper exploration of what it means for ensembles of WANNs to perform well despite randomisation of weights -- - Repeat with HyperNEAT, which "can evolve neural networks with millions of connections and exploit geometric regularities in the task domain" -- - Exploration of relative training time vs. state of the art --- # Recent Related Work (not exhaustive) - [Evolving the Topology of Large Scale Deep Neural Networks](http://www.human-competitive.org/sites/default/files/assuncao-paper-a.pdf), Assunção et al (2018) - Evolves CNNs with "state of the art" performance using topologies that are "unlikely to be designed by hand" - [Evolving Deep Neural Networks](http://www.human-competitive.org/sites/default/files/miikkulainen-neural-paper.pdf), Miikkulainen et al (2017) - Introduces CoDeepNEAT, which optimises topology, components and hyper-parameters of Deep Neural Networks - Expects that increases in available computing power will reduce need for human input into network design --- class: center, middle # Thank you ## Questions & Discussion