How do FP32 cores compare to tensor cores?
Where a FP32 core can do 2FLOP/cycle, a tensor core can do 128 FLOP/cycle for the RTX Titan. So about 128x speed up for matrix multiply-add.
How to utilize tensor cores in PyTorch?
Use 16-bit precision. Or google “use mixed precision”.
Why are many variables/hyperparams set to powers of 2?
Due to underlying hardware being based on powers of two these are the most efficient as they fully use the available compute. There is no need for zero padding matrix multiply-add for instance.
How do Google’s TPU’s stack up agains Nvidia’s tensor cores?
Google advertises with 420 TFLOPs and Nvidia with 130 TFLOPs.
What are important differences between needs of GPUs for gaming and deep learning.
- Deep Learning requires more memory for storing weight activations and back-propagation calculation than game’s requires.
- Deep Learning requires higher memory bandwidths than gaming, as many operations (e.g. activation) are so fast that the bottleneck is the transfer speed of data, not the computation speed.
What are important software frameworks?
These are important frameworks, with the most important in bold.
- Caffe2 (Facebook)
- PyTorch (Facebook)
- TensorFlow (Google)
- JAX (Google)
- PaddlePaddle (Baidu)
- MXNet (Amazon)
- CNTK (Microsoft)
- Chainer (Japanese Venture) Discontinued
What are the three important concepts in PyTorch?
- Tensors. Arrays for the GPU.
- Autograd. Computes gradients for you.
- Module. Class for network layers.
How does PyTorch’s autograd work?
- All computations involving tensors which were created with the flag “requires_grad=True” are combined in a computations graph.
- When calling the “backward” method on the loss scalar it traverses the graph in reversed order to compute all gradients, these stored in each variable’s “grad” property.
- Then these “grad” properties can be used to change all variables that have them. Usually these gradients are multiplied by a scalar learning rate to get the new value for each variable.
- When changing variables using the “grad” remember to do this with torch.no_grad() (to disable computational graph building) set and afterwards zero all the gradients.
Why would you define a autograd.Function in PyTorch?
A class which inherits from “autograd.Function” implements a python function which will be represented as a single node in the computation graph PyTorch builds. Instead one could use standard python primitives to implement the same function, however, each used primitive will then add a single node to the computation graph. This might lead to instability errors such as NANs. However, using “autograd.Function” is not really common in practice.
What is the NN module in PyTorch?
The NN module is an object oriented way of building networks which makes it easier to build models as standard layers, loss functions and tracking of parameters with gradients are all already implemented.
What does the optimizer module in PyTorch do?
It updates all the parameters in a model for you by looping over them and using the learning rate.
How to create static computation graphs in PyTorch?
Use the method “jit.script”.
Why would one use a dynamic or static computation graph?
- Dynamic graphs allow for more flexibility, as control flow can be used during runtime.
- Static graphs do not change and can therefore be optimized and also serialized. This allows for faster execution, potentially in different engines that do not use python.
- Dynamic graphs are usually easier to debug.