disjoint subset

Parallel Neural Networks
Joe Bradish
Background
 Deep
Neural Networks (DNNs) have become one of the
leading technologies in artificial intelligence and machine
learning
 Used extensively by major corporations – Google, Facebook
 Very expensive to train
 Large datasets – Often in the terabytes
 The larger the dataset, the more accurately the network can model
the underlying classification function
 Limiting
factor has almost always been computational
power, but we are starting to reach levels that can solve
previously impossible problems
Quick Neural Net Basics
 Set of layers of neurons which represent
synapses in the network
 Each neuron has a set of inputs
Either inputs into the entire network
 Or, outputs from previous neurons, usually
from the previous layer

 Underlying algorithm of network can
vary greatly
Multilayer feed forward
 Feedback Network
 Self-organizing maps

 Maps from high dimension to lower in 1 layer

Sparse Distributed Memory
 Two layer feedforward, associative memory
Learning / Training
 Everything is reliant on the weights
 Determine the importance of each signal
which is essential to the network output
 The training cycle adjusts the weights
 By far, the most critical step of a successful
neural network
 No
training = useless network
 Network
topology is also key to training
and especially parallelization
Where can we use parallelization?
 Typical
structure of a neural
network:
 For each training session
 For each training example in the
session
For each neuron in the layer
For all the weights of the neuron
For all the bits of the weight
value
 Implies
these ways of
parallelization:
 Training
session parallelism
Training example parallelism
 Layer parallelism
Neuron parallelism
Weight parallelism
Bit parallelism
Example - Network level Parallelism
Notice there are many
different neural
networks, some of
which feed into each
other
The outputs are sent to
different machines and
then aggregated once
more
Example – Neuron Level Parallelism
 Each
Neuron is assigned a
specific controlling entity on
the communication network
 Each
computer is responsible
for forwarding the weights to
the hub so that the
computer controlling the
next layer can feed it into
the neural network
 Uses a broadcast system
Parallelism by Node
 Used
to parallelize serial
backpropagation
 Usually implemented as a series of matrixvector operations
 Achieved
using an all-to-all
broadcasts
 Each
node (on a cluster) is
responsible for a subset of the
network
 Uses
master broadcaster
Parallelization by Node Cont.

Forward propagation is straight
forward

Backward propagation more
complicated
1.
Master broadcasts previous layer’s
output vector
1.
Master scatters error vector to current
layer
2.
Each process computers its subset
of the current layer’s output vector
2.
Each process computes their weight
change for its subset
3.
Master gathers from all processes
and prepares vector for next
broadcast
3.
Each process computes their error
vector for the previous layer
4.
Each process sends its contribution to
error vector to master
5.
Master sums contributions and
prepares previous layer’s error vector
for broadcast
Results of Node Parallelization
• MPI used for communication
between nodes
• 32 machine cluster of Intel Pentium
II
• Up to 16.36x speedup with 32
processes
Results of Node Parallelization Cont.
Parallelism by training example
 Each
process determines the
weight change on a disjoint subset
of the training population
 Changes
Uses master-slave style topology
are aggregated and
then applied to neural network
after each epoch (set of training)
 Low
levels of synchronization
needed
 Only requires two additional
 Very simple to implement
steps
Speedups using Exemplar Parallelization
Max speedup with 32 processes – 16.66x
Conclusion
 Many different strategies for parallelization
 Strategy depends on shape, size, type of training data
Node excels at small datasets and on-line learning
 Exemplar gives best performance on large training datasets

 Different topologies will perform radically different when using the
same parallelization strategy
 On-going research
 GPUs have become very prevalent, due to their ability to perform
matrix operations in parallel

Sometimes it is harder to link multiple GPUS
 Large clusters of weaker machines have also become prevalent,
due to reduced cost

Amazon, Google, and Microsoft offer commercial products for scalable
neural networks on their clouds
Questions?