Parallel Neural Networks Joe Bradish Background Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence and machine learning Used extensively by major corporations – Google, Facebook Very expensive to train Large datasets – Often in the terabytes The larger the dataset, the more accurately the network can model the underlying classification function Limiting factor has almost always been computational power, but we are starting to reach levels that can solve previously impossible problems Quick Neural Net Basics Set of layers of neurons which represent synapses in the network Each neuron has a set of inputs Either inputs into the entire network Or, outputs from previous neurons, usually from the previous layer Underlying algorithm of network can vary greatly Multilayer feed forward Feedback Network Self-organizing maps Maps from high dimension to lower in 1 layer Sparse Distributed Memory Two layer feedforward, associative memory Learning / Training Everything is reliant on the weights Determine the importance of each signal which is essential to the network output The training cycle adjusts the weights By far, the most critical step of a successful neural network No training = useless network Network topology is also key to training and especially parallelization Where can we use parallelization? Typical structure of a neural network: For each training session For each training example in the session For each neuron in the layer For all the weights of the neuron For all the bits of the weight value Implies these ways of parallelization: Training session parallelism Training example parallelism Layer parallelism Neuron parallelism Weight parallelism Bit parallelism Example - Network level Parallelism Notice there are many different neural networks, some of which feed into each other The outputs are sent to different machines and then aggregated once more Example – Neuron Level Parallelism Each Neuron is assigned a specific controlling entity on the communication network Each computer is responsible for forwarding the weights to the hub so that the computer controlling the next layer can feed it into the neural network Uses a broadcast system Parallelism by Node Used to parallelize serial backpropagation Usually implemented as a series of matrixvector operations Achieved using an all-to-all broadcasts Each node (on a cluster) is responsible for a subset of the network Uses master broadcaster Parallelization by Node Cont. Forward propagation is straight forward Backward propagation more complicated 1. Master broadcasts previous layer’s output vector 1. Master scatters error vector to current layer 2. Each process computers its subset of the current layer’s output vector 2. Each process computes their weight change for its subset 3. Master gathers from all processes and prepares vector for next broadcast 3. Each process computes their error vector for the previous layer 4. Each process sends its contribution to error vector to master 5. Master sums contributions and prepares previous layer’s error vector for broadcast Results of Node Parallelization • MPI used for communication between nodes • 32 machine cluster of Intel Pentium II • Up to 16.36x speedup with 32 processes Results of Node Parallelization Cont. Parallelism by training example Each process determines the weight change on a disjoint subset of the training population Changes Uses master-slave style topology are aggregated and then applied to neural network after each epoch (set of training) Low levels of synchronization needed Only requires two additional Very simple to implement steps Speedups using Exemplar Parallelization Max speedup with 32 processes – 16.66x Conclusion Many different strategies for parallelization Strategy depends on shape, size, type of training data Node excels at small datasets and on-line learning Exemplar gives best performance on large training datasets Different topologies will perform radically different when using the same parallelization strategy On-going research GPUs have become very prevalent, due to their ability to perform matrix operations in parallel Sometimes it is harder to link multiple GPUS Large clusters of weaker machines have also become prevalent, due to reduced cost Amazon, Google, and Microsoft offer commercial products for scalable neural networks on their clouds Questions?
© Copyright 2026 Paperzz