PyTorch Doc
Modules
- read A Simple Custom Module
- “Note that the module itself is callable, and that calling it invokes its
forward()
function. This name is in reference to the concepts of “forward pass” and “backward pass”, which apply to each module.- The “forward pass” is responsible for applying the computation represented by the module to the given input(s) (as shown in the above snippet).
- The “backward pass” computes gradients of module outputs with respect to its inputs, which can be used for “training” parameters through gradient descent methods.
- PyTorch’s autograd system automatically takes care of this backward pass computation, so it is not required to manually implement a
backward()
function for each module.”
- PyTorch’s autograd system automatically takes care of this backward pass computation, so it is not required to manually implement a
How does PyTorch create a computational graph?
Tensors
- “On it’s own,
Tensor
is just like a numpyndarray
. A data structure that can let you do fast linear algebra options. If you want PyTorch to create a graph corresponding to these operations, you will have to set therequires_grad
attribute of theTensor
to True.” - “
requires_grad
is contagious. It means that when aTensor
is created by operating on otherTensor
s, therequires_grad
of the resultantTensor
would be setTrue
given at least one of the tensors used for creation has it’srequires_grad
set toTrue
.” - “Each
Tensor
has […] an attribute calledgrad_fn
, which refers to the mathematical operator that creates the variable [d.h. zB., wenn die Variabled
überd = w3*b + w4*c
definiert ist, dann ist dasgrad_fn
vond
der Additionsoperator+
]. Ifrequires_grad
is set to False,grad_fn
would beNone
.” (kann man mitprint("The grad fn for a is", a.grad_fn)
testen!) (lies das nochmal genauer im Post!) - “One can use the member function
is_leaf
to determine whether a variable is a leafTensor
or not.”
torch.nn.Autograd.Function
class
- “This class has two important member functions we need to look at.”:
forward
- “simply computes the output using it’s inputs”
backward
- “takes the incoming gradient coming from the the part of the network in front of it. As you can see, the gradient to be backpropagated from a function $f$ is basically the gradient that is backpropagated to $f$ from the layers in front of it multiplied by the local gradient of the output of f with respect to it’s inputs. This is exactly what the
backward
function does.” (lies das nochmal genauer nach!)- Let’s again understand with our example of \(d = f(w_3b , w_4c)\)
- d is our
Tensor
here. It’sgrad_fn
is<ThAddBackward>
. This is basically the addition operation since the function that creates d adds inputs. - The
forward
function of the it’sgrad_fn
receives the inputs $w_3b$ and $w_4c$ and adds them. This value is basically stored in the d - The
backward
function of the<ThAddBackward>
basically takes the the incoming gradient from the further layers as the input. This is basically $\frac{\partial{L}}{\partial{d}}$ coming along the edge leading from L to d. This gradient is also the gradient of L w.r.t to d and is stored ingrad
attribute of thed
. It can be accessed by callingd.grad
. - It then takes computes the local gradients $\frac{\partial{d}}{\partial{w_4c}}$ and $\frac{\partial{d}}{\partial{w_3b}}$.
- Then the backward function multiplies the incoming gradient with the locally computed gradients respectively and “sends” the gradients to it’s inputs by invoking the backward method of the
grad_fn
of their inputs. - For example, the
backward
function of<ThAddBackward>
associated with d invokes backward function of thegrad_fn
of the $w_4*c$ (Here, $w_4*c$ is a intermediate Tensor, and it’sgrad_fn
is<ThMulBackward>
. At time of invocation of thebackward
function, the gradient $\frac{\partial{L}}{\partial{d}} * \frac{\partial{d}}{\partial{w_4c}} $ is passed as the input. - Now, for the variable $w_4*c$, $\frac{\partial{L}}{\partial{d}} * \frac{\partial{d}}{\partial{w_4c}} $ becomes the incoming gradient, like $\frac{\partial{L}}{\partial{d}} $ was for $d$ in step 3 and the process repeats.
- d is our
- Let’s again understand with our example of \(d = f(w_3b , w_4c)\)
- “takes the incoming gradient coming from the the part of the network in front of it. As you can see, the gradient to be backpropagated from a function $f$ is basically the gradient that is backpropagated to $f$ from the layers in front of it multiplied by the local gradient of the output of f with respect to it’s inputs. This is exactly what the
How are PyTorch’s graphs different from TensorFlow graphs
- PyTorch creates something called a Dynamic Computation Graph, which means that the graph is generated on the fly.
- in contrast to the Static Computation Graphs used by TensorFlow where the graph is declared before running the program
-
Until the
forward
function of a Variable is called, there exists no node for theTensor
(it’sgrad_fn
) in the graph.```python a = torch.randn((3,3), requires_grad = True) #No graph yet, as a is a leaf w1 = torch.randn((3,3), requires_grad = True) #Same logic as above b = w1*a #Graph with node `mulBackward` is created. ```
-
The graph is created as a result of
forward
function of many Tensors being invoked. Only then, the buffers for the non-leaf nodes are allocated for the graph and intermediate values (used for computing gradients later). When you callbackward
, as the gradients are computed, these buffers (for non-leaf variables) are essentially freed, and the graph is destroyed (In a sense, you can't backpropagate through it, since the buffers holding values to compute the gradients are gone). -
Next time, you will call
forward
on the same set of tensors, the leaf node buffers from the previous run will be shared, while the non-leaf nodes buffers will be created again. - lies den Abschnitt im Post !
Weight files
- source:
- There are no differences between the extensions that were listed:
.pt
,.pth
,.pwf
. One can use whatever extension (s)he wants. So, if you’re usingtorch.save()
for saving models, then it by default uses python pickle (pickle_module=pickle
) to save the objects and some metadata. Thus, you have the liberty to choose the extension you want, as long as it doesn’t cause collisions with any other standardized extensions. - Having said that, it is however not recommended to use
.pth
extension when checkpointing models because it collides with Python path (.pth
) configuration files. Because of this, I myself use.pth.tar
or.pt
but not.pth
, or any other extensions.
- There are no differences between the extensions that were listed: