PyTorch Doc
Modules
- read A Simple Custom Module
- “Note that the module itself is callable, and that calling it invokes its
forward()function. This name is in reference to the concepts of “forward pass” and “backward pass”, which apply to each module.- The “forward pass” is responsible for applying the computation represented by the module to the given input(s) (as shown in the above snippet).
- The “backward pass” computes gradients of module outputs with respect to its inputs, which can be used for “training” parameters through gradient descent methods.
- PyTorch’s autograd system automatically takes care of this backward pass computation, so it is not required to manually implement a
backward()function for each module.”
- PyTorch’s autograd system automatically takes care of this backward pass computation, so it is not required to manually implement a
How does PyTorch create a computational graph?
Tensors
- “On it’s own,
Tensoris just like a numpyndarray. A data structure that can let you do fast linear algebra options. If you want PyTorch to create a graph corresponding to these operations, you will have to set therequires_gradattribute of theTensorto True.” - “
requires_gradis contagious. It means that when aTensoris created by operating on otherTensors, therequires_gradof the resultantTensorwould be setTruegiven at least one of the tensors used for creation has it’srequires_gradset toTrue.” - “Each
Tensorhas […] an attribute calledgrad_fn, which refers to the mathematical operator that creates the variable [d.h. zB., wenn die Variabledüberd = w3*b + w4*cdefiniert ist, dann ist dasgrad_fnvondder Additionsoperator+]. Ifrequires_gradis set to False,grad_fnwould beNone.” (kann man mitprint("The grad fn for a is", a.grad_fn)testen!) (lies das nochmal genauer im Post!) - “One can use the member function
is_leafto determine whether a variable is a leafTensoror not.”
torch.nn.Autograd.Function class
- “This class has two important member functions we need to look at.”:
forward- “simply computes the output using it’s inputs”
backward- “takes the incoming gradient coming from the the part of the network in front of it. As you can see, the gradient to be backpropagated from a function $f$ is basically the gradient that is backpropagated to $f$ from the layers in front of it multiplied by the local gradient of the output of f with respect to it’s inputs. This is exactly what the
backwardfunction does.” (lies das nochmal genauer nach!)- Let’s again understand with our example of \(d = f(w_3b , w_4c)\)
- d is our
Tensorhere. It’sgrad_fnis<ThAddBackward>. This is basically the addition operation since the function that creates d adds inputs. - The
forwardfunction of the it’sgrad_fnreceives the inputs $w_3b$ and $w_4c$ and adds them. This value is basically stored in the d - The
backwardfunction of the<ThAddBackward>basically takes the the incoming gradient from the further layers as the input. This is basically $\frac{\partial{L}}{\partial{d}}$ coming along the edge leading from L to d. This gradient is also the gradient of L w.r.t to d and is stored ingradattribute of thed. It can be accessed by callingd.grad. - It then takes computes the local gradients $\frac{\partial{d}}{\partial{w_4c}}$ and $\frac{\partial{d}}{\partial{w_3b}}$.
- Then the backward function multiplies the incoming gradient with the locally computed gradients respectively and “sends” the gradients to it’s inputs by invoking the backward method of the
grad_fnof their inputs. - For example, the
backwardfunction of<ThAddBackward>associated with d invokes backward function of thegrad_fnof the $w_4*c$ (Here, $w_4*c$ is a intermediate Tensor, and it’sgrad_fnis<ThMulBackward>. At time of invocation of thebackwardfunction, the gradient $\frac{\partial{L}}{\partial{d}} * \frac{\partial{d}}{\partial{w_4c}} $ is passed as the input. - Now, for the variable $w_4*c$, $\frac{\partial{L}}{\partial{d}} * \frac{\partial{d}}{\partial{w_4c}} $ becomes the incoming gradient, like $\frac{\partial{L}}{\partial{d}} $ was for $d$ in step 3 and the process repeats.
- d is our
- Let’s again understand with our example of \(d = f(w_3b , w_4c)\)
- “takes the incoming gradient coming from the the part of the network in front of it. As you can see, the gradient to be backpropagated from a function $f$ is basically the gradient that is backpropagated to $f$ from the layers in front of it multiplied by the local gradient of the output of f with respect to it’s inputs. This is exactly what the
How are PyTorch’s graphs different from TensorFlow graphs
- PyTorch creates something called a Dynamic Computation Graph, which means that the graph is generated on the fly.
- in contrast to the Static Computation Graphs used by TensorFlow where the graph is declared before running the program
-
Until the
forwardfunction of a Variable is called, there exists no node for theTensor(it’sgrad_fn) in the graph.```python a = torch.randn((3,3), requires_grad = True) #No graph yet, as a is a leaf w1 = torch.randn((3,3), requires_grad = True) #Same logic as above b = w1*a #Graph with node `mulBackward` is created. ``` -
The graph is created as a result of
forwardfunction of many Tensors being invoked. Only then, the buffers for the non-leaf nodes are allocated for the graph and intermediate values (used for computing gradients later). When you callbackward, as the gradients are computed, these buffers (for non-leaf variables) are essentially freed, and the graph is destroyed (In a sense, you can't backpropagate through it, since the buffers holding values to compute the gradients are gone). -
Next time, you will call
forwardon the same set of tensors, the leaf node buffers from the previous run will be shared, while the non-leaf nodes buffers will be created again. - lies den Abschnitt im Post !
Weight files
- source:
- There are no differences between the extensions that were listed:
.pt,.pth,.pwf. One can use whatever extension (s)he wants. So, if you’re usingtorch.save()for saving models, then it by default uses python pickle (pickle_module=pickle) to save the objects and some metadata. Thus, you have the liberty to choose the extension you want, as long as it doesn’t cause collisions with any other standardized extensions. - Having said that, it is however not recommended to use
.pthextension when checkpointing models because it collides with Python path (.pth) configuration files. Because of this, I myself use.pth.taror.ptbut not.pth, or any other extensions.
- There are no differences between the extensions that were listed: