Miscellaneous Modules
June 27, 2017 ยท View on GitHub
Miscellaneous Modules
Miscellaneous modules and criterions :
- MaskZero : zeroes the
outputandgradOutputrows of the decorated module for commensurateinputrows which are tensors of zeros (version 1);zeroMaskelements which are 1 (version 2);
- LookupTableMaskZero : extends
nn.LookupTableto support zero indexes for padding. Zero indexes are forwarded as tensors of zeros; - MaskZeroCriterion : zeros the
gradInputandlossrows of the decorated criterion for commensurateinputrows which are tensors of zeros (version 1);zeroMaskelements which are 1 (version 2);
- ReverseSequence : reverse the order of elements in a sequence (table or tensor);
- ReverseUnreverse : used internally by
nn.BiSequencerfor decoratingbwdRNN. - SpatialGlimpse : takes a fovead glimpse of an image at a given location;
- NCEModule : optimized placeholder for a
Linear+SoftMaxusing noise-contrastive estimation. - NCECriterion : criterion exclusively used with NCEModule;
- VariableLength: decorates a
Sequencerto accept and produce a table of variable length inputs and outputs; - dpnn modules: The Module interface has been further extended with methods that facilitate stochastic gradient descent like updateGradParameters (for momentum learning), weightDecay, maxParamNorm (for regularization), and so on.
MaskZero
This module implements zero-masking.
Zero-masking implements the zeroing specific rows/samples of a module's output and gradInput states.
Zero-masking is used for efficiently processing variable length sequences.
mz = nn.MaskZero(module, [v1, maskinput, maskoutput])
This module zeroes the output and gradOutput rows of the decorated module where
- the commensurate row of the
inputis a tensor of zeros (version 1 withv1=true); or - the commensurate element of the
zeroMasktensor is 1 (version 2 withv1=false, the default).
Version 2 (the default), requires that setZeroMask(zeroMask)
be called beforehand. The zeroMask must be a torch.ByteTensor or torch.CudaByteTensor of size batchsize.
At a given time-step t, a sample i is masked when:
- the
input[i]is a row of zeros (version 1) whereinputis a batched time-step; or - the
zeroMask[{t,i}] = 1(version 2).
When a sample time-step is masked, the hidden state is effectively reset (that is, forgotten) for the next non-mask time-step. In other words, it is possible seperate unrelated sequences with a masked element.
When maskoutput=true (the default), output and gradOutput are zero-masked.
When maskinput=true (not the default), input and gradInput aere zero-masked.
Zero-masking only supports batch mode.
Caveat: MaskZero does not guarantee that the output and gradOutput tensors of the internal modules
of the decorated module will be zeroed.
MaskZero only affects the immediate gradOutput and output of the module that it encapsulates.
However, for most modules, the gradient update for that time-step will be zero because
backpropagating a gradient of zeros will typically yield zeros all the way to the input.
In this respect, modules that shouldn't be encapsulated inside a MaskZero are AbsractRecurrent
instances as the flow of gradients between different time-steps internally.
Instead, call the AbstractRecurrent.maskZero method
to encapsulate the internal stepmodule.
See the noise-contrastive-estimate.lua script for an example implementation of version 2 zero-masking. See the simple-bisequencer-network-variable.lua script for an example implementation of version 1 zero-masking.
setZeroMask(zeroMask)
Set the zeroMask of the MaskZero module (required for version 2 forwards).
For example,
batchsize = 3
inputsize, outputsize = 2, 1
-- an nn.Linear module decorated with MaskZero (version 2)
module = nn.MaskZero(nn.Linear(inputsize, outputsize))
-- zero-mask the second sample/row
zeroMask = torch.ByteTensor(batchsize):zero()
zeroMask[2] = 1
module:setZeroMask(zeroMask)
-- forward
input = torch.randn(batchsize, inputsize)
output = module:forward(input)
print(output)
0.6597
0.0000
0.8170
[torch.DoubleTensor of size 3x1]
The output is indeed zeroed for the second sample (zeroMask[2] = 1).
The gradInput would also be zeroed in the same way because the gradOutput would be zeroed:
gradOutput = torch.randn(batchsize, outputsize)
gradInput = module:backward(input, gradOutput)
print(gradInput)
0.8187 0.0534
0.0000 0.0000
0.1742 0.0114
[torch.DoubleTensor of size 3x2]
For Container modules, a call to setZeroMask() is propagated to all component modules that expect a zeroMask.
When zeroMask=false, the zero-masking is disabled.
LookupTableMaskZero
This module extends nn.LookupTable to support zero indexes. Zero indexes are forwarded as zero tensors.
lt = nn.LookupTableMaskZero(nIndex, nOutput)
The output Tensor will have each row zeroed when the commensurate row of the input is a zero index.
This lookup table makes it possible to pad sequences with different lengths in the same batch with zero vectors.
Note that this module ignores version 2 zero-masking, and therefore expects inputs to be zeros where needed.
MaskZeroCriterion
This criterion ignores samples (rows in the input and target tensors)
where the zeroMask ByteTensor passed to MaskZeroCriterion:setZeroMask(zeroMask) is 1.
This criterion only supports batch-mode.
batchsize = 3
zeroMask = torch.ByteTensor(batchsize):zero()
zeroMask[2] = 1 -- the 2nd sample in batch is ignored
mzc = nn.MaskZeroCriterion(criterion)
mzc:setZeroMask(zeroMask)
loss = mzc:forward(input, target)
gradInput = mzc:backward(input, target)
assert(gradInput[2]:sum() == 0)
In the above example, the second row of the gradInput Tensor is zero.
This is because the commensurate row in the zeroMask is a one.
The call to forward also disregards the second sample in measuring the loss.
This decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.
ReverseSequence
module = nn.ReverseSequence()
Reverses the order of elements in a sequence table or a tensor.
Example using table:
print(module:forward{1,2,3,4})
{4,3,2,1}
Example using tensor:
print(module:forward(torch.Tensor({1,2,3,4})))
4
3
2
1
[torch.DoubleTensor of size 4]
ReverseUnreverse
ru = nn.ReverseUnreverse(sequencer)
g```
This module is used internally by the [BiSequencer](sequencer.md#rnn.BiSequencer) module.
The `ReverseUnreverse` decorates a `sequencer` module like [SeqLSTM](sequencer.md#rnn.SeqLSTM) and [Sequencer](sequencer.md#rnn.Sequencer)
The `sequencer` module is expected to implement the [AbstractSequencer](sequencer.md#rnn.AbstractSequencer) interface.
When calling `forward`, the `seqlen x batchsize [x ...]` `input` tensor is reversed using [ReverseSequence](sequencer.md#rnn.ReverseSequence).
Then the `input` sequences are forwarded (in reverse order) through the `sequencer`.
The resulting `sequencer.output` sequences are reversed with respect to the `input`.
Before being returned to the caller, these are unreversed using another `ReverseSequence`.
<a name='nn.SpatialGlimpse'></a>
## SpatialGlimpse ##
Ref. A. [Recurrent Model for Visual Attention](http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf)
```lua
module = nn.SpatialGlimpse(size, depth, scale)
A glimpse is the concatenation of down-scaled cropped images of
increasing scale around a given location in a given image.
The input is a pair of Tensors: {image, location}
location are (y,x) coordinates of the center of the different scales
of patches to be cropped from image image.
Coordinates are between (-1,-1) (top-left) and (1,1) (bottom-right).
The output is a batch of glimpses taken in image at location (y,x).
size can be either a scalar which specifies the width = height of glimpses,
or a table of {height, width} to support a rectangular shape of glimpses.
depth is number of patches to crop per glimpse (one patch per depth).
scale determines the size(t) = scale * size(t-1) of successive cropped patches.
So basically, this module can be used to focus the attention of the model
on a region of the input image.
It is commonly used with the RecurrentAttention
module (see this example).
NCEModule
Ref. A RNNLM training with NCE for Speech Recognition
ncem = nn.NCEModule(inputSize, outputSize, k, unigrams, [Z])
When used in conjunction with NCECriterion,
the NCEModule implements noise-contrastive estimation.
The point of the NCE is to speedup computation for large Linear + SoftMax layers.
Computing a forward/backward for Linear(inputSize, outputSize) for a large outputSize can be very expensive.
This is common when implementing language models having with large vocabularies of a million words.
In such cases, NCE can be an efficient alternative to computing the full Linear + SoftMax during training and
cross-validation.
The inputSize and outputSize are the same as for the Linear module.
The number of noise samples to be drawn per example is k. A value of 25 should work well.
Increasing it will yield better results, while a smaller value will be more efficient to process.
The unigrams is a tensor of size outputSize that contains the frequencies or probability distribution over classes.
It is used to sample noise samples via a fast implementation of torch.multinomial.
The Z is the normalization constant of the approximated SoftMax.
The default is math.exp(9) as specified in Ref. A.
For inference, or measuring perplexity, the full Linear + SoftMax will need to
be computed. The NCEModule can do this by switching on the following :
ncem:evaluate()
ncem.normalized = true
Furthermore, to simulate Linear + LogSoftMax instead, one need only add the following to the above:
ncem.logsoftmax = true
An example is provided via the rnn package.
NCECriterion
ncec = nn.NCECriterion()
This criterion only works with an NCEModule on the output layer. Together, they implement noise-contrastive estimation.
VariableLength
vlrnn = nn.VariableLength(seqrnn, [lastOnly])
This module decorates a seqrnn to accept and produce a table of variable length inputs and outputs.
The seqrnn can be any module the accepts and produces a zero-masked sequence as input and output.
These include Sequencer, SeqLSTM, SeqGRU, and so on and so forth.
For example:
maxLength, hiddenSize, batchSize = 10, 4, 3
-- dummy variable length input
input = {}
for i=1,batchSize do
-- each sample is a variable length sequence
input[i] = torch.randn(torch.random(1,maxLength), hiddenSize)
end
-- create zero-masked LSTM (note calls to maskZero())
seqrnn = nn.Sequential()
:add(nn.SeqLSTM(hiddenSize, hiddenSize):maskZero())
:add(nn.Dropout(0.5))
:add(nn.SeqLSTM(hiddenSize, hiddenSize):maskZero())
-- decorate with variable length module
vlrnn = nn.VariableLength(seqrnn)
output = vlrnn:forward(input)
print(output)
{
1 : DoubleTensor - size: 7x4
2 : DoubleTensor - size: 3x4
3 : DoubleTensor - size: 2x4
}
By default lastOnly is false. When true, vlrnn only produces the last step of each variable-length sequence.
These last-steps are output as a tensor:
vlrnn.lastOnly = true
output = vlrnn:forward(input)
print(output)
-1.3430 0.1397 -0.1736 0.6332
-1.0903 0.2746 -0.3415 -0.2061
0.7934 1.1306 0.8104 1.9069
[torch.DoubleTensor of size 3x4]
The module doesn't support CUDA.
Module
The Module interface has been further extended with methods that facilitate stochastic gradient descent like updateGradParameters (for momentum learning), weightDecay, maxParamNorm (for regularization), and so on.
Module.dpnn_parameters
A table that specifies the name of parameter attributes.
Defaults to {'weight', 'bias'}, which is a static variable (i.e. table exists in class namespace).
Sub-classes can define their own table statically.
Module.dpnn_gradParameters
A table that specifies the name of gradient w.r.t. parameter attributes.
Defaults to {'gradWeight', 'gradBias'}, which is a static variable (i.e. table exists in class namespace).
Sub-classes can define their own table statically.
[self] Module:type(type_str)
This function converts all the parameters of a module to the given type_str.
The type_str can be one of the types defined for torch.Tensor
like torch.DoubleTensor, torch.FloatTensor and torch.CudaTensor.
Unlike the type method
defined in nn, this one was overriden to
maintain the sharing of storage
among Tensors. This is especially useful when cloning modules share parameters and gradParameters.
[clone] Module:sharedClone([shareParams, shareGradParams])
Similar to clone.
Yet when shareParams = true (the default), the cloned module will share the parameters
with the original module.
Furthermore, when shareGradParams = true (the default), the clone module will share
the gradients w.r.t. parameters with the original module.
This is equivalent to :
clone = mlp:clone()
clone:share(mlp, 'weight', 'bias', 'gradWeight', 'gradBias')
yet it is much more efficient, especially for modules with lots of parameters, as these
Tensors aren't needlessly copied during the clone.
This is particularly useful for RNNs which require efficient copies with shared parameters and gradient w.r.t. parameters for each time-step.
Module:maxParamNorm([maxOutNorm, maxInNorm])
This method implements a hard constraint on the upper bound of the norm of output and/or input neuron weights
(Hinton et al. 2012, p. 2) .
In a weight matrix, this is a contraint on rows (maxOutNorm) and/or columns (maxInNorm), respectively.
Has a regularization effect analogous to weightDecay, but with easier to optimize hyper-parameters.
Assumes that parameters are arranged (output dim x ... x input dim).
Only affects parameters with more than one dimension.
The method should normally be called after updateParameters.
It uses the C/CUDA optimized torch.renorm function.
Hint : maxOutNorm = 2 usually does the trick.
[momGradParams] Module:momentumGradParameters()
Returns a table of Tensors (momGradParams). For each element in the
table, a corresponding parameter (params) and gradient w.r.t. parameters
(gradParams) is returned by a call to parameters.
This method is used internally by updateGradParameters.
Module:updateGradParameters(momFactor [, momDamp, momNesterov])
Applies classic momentum or Nesterov momentum (Sutskever, Martens et al, 2013) to parameter gradients.
Each parameter Tensor (params) has a corresponding Tensor of the same size for gradients w.r.t. parameters (gradParams).
When using momentum learning, another Tensor is added for each parameter Tensor (momGradParams).
This method should be called before updateParameters
as it affects the gradients w.r.t. parameters.
Classic momentum is computed as follows :
momGradParams = momFactor*momGradParams + (1-momDamp)*gradParams
gradParams = momGradParams
where momDamp has a default value of momFactor.
Nesterov momentum (momNesterov = true) is computed as follows (the first line is the same as classic momentum):
momGradParams = momFactor*momGradParams + (1-momDamp)*gradParams
gradParams = gradParams + momFactor*momGradParams
The default is to use classic momentum (momNesterov = false).
Module:weightDecay(wdFactor [, wdMinDim])
Decays the weight of the parameterized models.
Implements an L2 norm loss on parameters with dimensions greater or equal to wdMinDim (default is 2).
The resulting gradients are stored into the corresponding gradients w.r.t. parameters.
Such that this method should be called before updateParameters.
Module:gradParamClip(cutoffNorm [, moduleLocal])
Implements a contrainst on the norm of gradients w.r.t. parameters (Pascanu et al. 2012).
When moduleLocal = false (the default), the norm is calculated globally to Module for which this is called.
So if you call it on an MLP, the norm is computed on the concatenation of all parameter Tensors.
When moduleLocal = true, the norm constraint is applied
to the norm of all parameters in each component (non-container) module.
This method is useful to prevent the exploding gradient in
Recurrent neural networks.