Miscellaneous Modules

June 27, 2017 · View on GitHub

Miscellaneous Modules

Miscellaneous modules and criterions :

MaskZero : zeroes the output and gradOutput rows of the decorated module for commensurate
- input rows which are tensors of zeros (version 1);
- zeroMask elements which are 1 (version 2);
LookupTableMaskZero : extends nn.LookupTable to support zero indexes for padding. Zero indexes are forwarded as tensors of zeros;
MaskZeroCriterion : zeros the gradInput and loss rows of the decorated criterion for commensurate
- input rows which are tensors of zeros (version 1);
- zeroMask elements which are 1 (version 2);
ReverseSequence : reverse the order of elements in a sequence (table or tensor);
ReverseUnreverse : used internally by nn.BiSequencer for decorating bwd RNN.
SpatialGlimpse : takes a fovead glimpse of an image at a given location;
NCEModule : optimized placeholder for a Linear + SoftMax using noise-contrastive estimation.
NCECriterion : criterion exclusively used with NCEModule;
VariableLength: decorates a Sequencer to accept and produce a table of variable length inputs and outputs;
dpnn modules: The Module interface has been further extended with methods that facilitate stochastic gradient descent like updateGradParameters (for momentum learning), weightDecay, maxParamNorm (for regularization), and so on.

MaskZero

This module implements zero-masking. Zero-masking implements the zeroing specific rows/samples of a module's output and gradInput states. Zero-masking is used for efficiently processing variable length sequences.

mz = nn.MaskZero(module, [v1, maskinput, maskoutput])

This module zeroes the output and gradOutput rows of the decorated module where

the commensurate row of the input is a tensor of zeros (version 1 with v1=true); or
the commensurate element of the zeroMask tensor is 1 (version 2 with v1=false, the default).

Version 2 (the default), requires that setZeroMask(zeroMask) be called beforehand. The zeroMask must be a torch.ByteTensor or torch.CudaByteTensor of size batchsize.

At a given time-step t, a sample i is masked when:

the input[i] is a row of zeros (version 1) where input is a batched time-step; or
the zeroMask[{t,i}] = 1 (version 2).

When a sample time-step is masked, the hidden state is effectively reset (that is, forgotten) for the next non-mask time-step. In other words, it is possible seperate unrelated sequences with a masked element.

When maskoutput=true (the default), output and gradOutput are zero-masked. When maskinput=true (not the default), input and gradInput aere zero-masked.

Zero-masking only supports batch mode.

Caveat: MaskZero does not guarantee that the output and gradOutput tensors of the internal modules of the decorated module will be zeroed. MaskZero only affects the immediate gradOutput and output of the module that it encapsulates. However, for most modules, the gradient update for that time-step will be zero because backpropagating a gradient of zeros will typically yield zeros all the way to the input. In this respect, modules that shouldn't be encapsulated inside a MaskZero are AbsractRecurrent instances as the flow of gradients between different time-steps internally. Instead, call the AbstractRecurrent.maskZero method to encapsulate the internal stepmodule.

See the noise-contrastive-estimate.lua script for an example implementation of version 2 zero-masking. See the simple-bisequencer-network-variable.lua script for an example implementation of version 1 zero-masking.

setZeroMask(zeroMask)

Set the zeroMask of the MaskZero module (required for version 2 forwards). For example,

batchsize = 3
inputsize, outputsize = 2, 1
-- an nn.Linear module decorated with MaskZero (version 2)
module = nn.MaskZero(nn.Linear(inputsize, outputsize))
-- zero-mask the second sample/row
zeroMask = torch.ByteTensor(batchsize):zero()
zeroMask[2] = 1
module:setZeroMask(zeroMask)
-- forward
input = torch.randn(batchsize, inputsize)
output = module:forward(input)
print(output)
 0.6597
 0.0000
 0.8170
[torch.DoubleTensor of size 3x1]

The output is indeed zeroed for the second sample (zeroMask[2] = 1). The gradInput would also be zeroed in the same way because the gradOutput would be zeroed:

gradOutput = torch.randn(batchsize, outputsize)
gradInput = module:backward(input, gradOutput)
print(gradInput)
 0.8187  0.0534
 0.0000  0.0000
 0.1742  0.0114
[torch.DoubleTensor of size 3x2]

For Container modules, a call to setZeroMask() is propagated to all component modules that expect a zeroMask.

When zeroMask=false, the zero-masking is disabled.

LookupTableMaskZero

This module extends nn.LookupTable to support zero indexes. Zero indexes are forwarded as zero tensors.

lt = nn.LookupTableMaskZero(nIndex, nOutput)

The output Tensor will have each row zeroed when the commensurate row of the input is a zero index.

This lookup table makes it possible to pad sequences with different lengths in the same batch with zero vectors.

Note that this module ignores version 2 zero-masking, and therefore expects inputs to be zeros where needed.

MaskZeroCriterion

This criterion ignores samples (rows in the input and target tensors) where the zeroMask ByteTensor passed to MaskZeroCriterion:setZeroMask(zeroMask) is 1. This criterion only supports batch-mode.

batchsize = 3
zeroMask = torch.ByteTensor(batchsize):zero()
zeroMask[2] = 1 -- the 2nd sample in batch is ignored
mzc = nn.MaskZeroCriterion(criterion)
mzc:setZeroMask(zeroMask)
loss = mzc:forward(input, target)
gradInput = mzc:backward(input, target)
assert(gradInput[2]:sum() == 0)

In the above example, the second row of the gradInput Tensor is zero. This is because the commensurate row in the zeroMask is a one. The call to forward also disregards the second sample in measuring the loss.

This decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.

ReverseSequence

module = nn.ReverseSequence()

Reverses the order of elements in a sequence table or a tensor.

Example using table:

print(module:forward{1,2,3,4})
{4,3,2,1}

Example using tensor:

print(module:forward(torch.Tensor({1,2,3,4})))
 4
 3
 2
 1
[torch.DoubleTensor of size 4]

ReverseUnreverse

ru = nn.ReverseUnreverse(sequencer)
g```

This module is used internally by the [BiSequencer](sequencer.md#rnn.BiSequencer) module.
The `ReverseUnreverse` decorates a `sequencer` module like [SeqLSTM](sequencer.md#rnn.SeqLSTM) and [Sequencer](sequencer.md#rnn.Sequencer)
The `sequencer` module is expected to implement the [AbstractSequencer](sequencer.md#rnn.AbstractSequencer) interface.
When calling `forward`, the `seqlen x batchsize [x ...]` `input` tensor is reversed using [ReverseSequence](sequencer.md#rnn.ReverseSequence).
Then the `input` sequences are forwarded (in reverse order) through the `sequencer`.
The resulting `sequencer.output` sequences are reversed with respect to the `input`.
Before being returned to the caller, these are unreversed using another `ReverseSequence`.

<a name='nn.SpatialGlimpse'></a>
## SpatialGlimpse ##
Ref. A. [Recurrent Model for Visual Attention](http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf)

```lua
module = nn.SpatialGlimpse(size, depth, scale)

A glimpse is the concatenation of down-scaled cropped images of increasing scale around a given location in a given image. The input is a pair of Tensors: {image, location} location are (y,x) coordinates of the center of the different scales of patches to be cropped from image image. Coordinates are between (-1,-1) (top-left) and (1,1) (bottom-right). The output is a batch of glimpses taken in image at location (y,x).

size can be either a scalar which specifies the width = height of glimpses, or a table of {height, width} to support a rectangular shape of glimpses. depth is number of patches to crop per glimpse (one patch per depth). scale determines the size(t) = scale * size(t-1) of successive cropped patches.

So basically, this module can be used to focus the attention of the model on a region of the input image. It is commonly used with the RecurrentAttention module (see this example).

NCEModule

Ref. A RNNLM training with NCE for Speech Recognition

ncem = nn.NCEModule(inputSize, outputSize, k, unigrams, [Z])

When used in conjunction with NCECriterion, the NCEModule implements noise-contrastive estimation.

The point of the NCE is to speedup computation for large Linear + SoftMax layers. Computing a forward/backward for Linear(inputSize, outputSize) for a large outputSize can be very expensive. This is common when implementing language models having with large vocabularies of a million words. In such cases, NCE can be an efficient alternative to computing the full Linear + SoftMax during training and cross-validation.

The inputSize and outputSize are the same as for the Linear module. The number of noise samples to be drawn per example is k. A value of 25 should work well. Increasing it will yield better results, while a smaller value will be more efficient to process. The unigrams is a tensor of size outputSize that contains the frequencies or probability distribution over classes. It is used to sample noise samples via a fast implementation of torch.multinomial. The Z is the normalization constant of the approximated SoftMax. The default is math.exp(9) as specified in Ref. A.

For inference, or measuring perplexity, the full Linear + SoftMax will need to be computed. The NCEModule can do this by switching on the following :

ncem:evaluate()
ncem.normalized = true

Furthermore, to simulate Linear + LogSoftMax instead, one need only add the following to the above:

ncem.logsoftmax = true

An example is provided via the rnn package.

NCECriterion

ncec = nn.NCECriterion()

This criterion only works with an NCEModule on the output layer. Together, they implement noise-contrastive estimation.

VariableLength

vlrnn = nn.VariableLength(seqrnn, [lastOnly])

This module decorates a seqrnn to accept and produce a table of variable length inputs and outputs. The seqrnn can be any module the accepts and produces a zero-masked sequence as input and output. These include Sequencer, SeqLSTM, SeqGRU, and so on and so forth.

For example:

maxLength, hiddenSize, batchSize = 10, 4, 3
-- dummy variable length input
input = {}
for i=1,batchSize do
   -- each sample is a variable length sequence
   input[i] = torch.randn(torch.random(1,maxLength), hiddenSize)
end

-- create zero-masked LSTM (note calls to maskZero())
seqrnn = nn.Sequential()
   :add(nn.SeqLSTM(hiddenSize, hiddenSize):maskZero())
   :add(nn.Dropout(0.5))
   :add(nn.SeqLSTM(hiddenSize, hiddenSize):maskZero())

-- decorate with variable length module
vlrnn = nn.VariableLength(seqrnn)

output = vlrnn:forward(input)
print(output)
{
  1 : DoubleTensor - size: 7x4
  2 : DoubleTensor - size: 3x4
  3 : DoubleTensor - size: 2x4
}

By default lastOnly is false. When true, vlrnn only produces the last step of each variable-length sequence. These last-steps are output as a tensor:

vlrnn.lastOnly = true
output = vlrnn:forward(input)
print(output)
-1.3430  0.1397 -0.1736  0.6332
-1.0903  0.2746 -0.3415 -0.2061
 0.7934  1.1306  0.8104  1.9069
[torch.DoubleTensor of size 3x4]

The module doesn't support CUDA.

Module

The Module interface has been further extended with methods that facilitate stochastic gradient descent like updateGradParameters (for momentum learning), weightDecay, maxParamNorm (for regularization), and so on.

Module.dpnn_parameters

A table that specifies the name of parameter attributes. Defaults to {'weight', 'bias'}, which is a static variable (i.e. table exists in class namespace). Sub-classes can define their own table statically.

Module.dpnn_gradParameters

A table that specifies the name of gradient w.r.t. parameter attributes. Defaults to {'gradWeight', 'gradBias'}, which is a static variable (i.e. table exists in class namespace). Sub-classes can define their own table statically.

[self] Module:type(type_str)

This function converts all the parameters of a module to the given type_str. The type_str can be one of the types defined for torch.Tensor like torch.DoubleTensor, torch.FloatTensor and torch.CudaTensor. Unlike the type method defined in nn, this one was overriden to maintain the sharing of storage among Tensors. This is especially useful when cloning modules share parameters and gradParameters.

[clone] Module:sharedClone([shareParams, shareGradParams])

Similar to clone. Yet when shareParams = true (the default), the cloned module will share the parameters with the original module. Furthermore, when shareGradParams = true (the default), the clone module will share the gradients w.r.t. parameters with the original module. This is equivalent to :

clone = mlp:clone()
clone:share(mlp, 'weight', 'bias', 'gradWeight', 'gradBias')

yet it is much more efficient, especially for modules with lots of parameters, as these Tensors aren't needlessly copied during the clone. This is particularly useful for RNNs which require efficient copies with shared parameters and gradient w.r.t. parameters for each time-step.

Module:maxParamNorm([maxOutNorm, maxInNorm])

This method implements a hard constraint on the upper bound of the norm of output and/or input neuron weights (Hinton et al. 2012, p. 2) . In a weight matrix, this is a contraint on rows (maxOutNorm) and/or columns (maxInNorm), respectively. Has a regularization effect analogous to weightDecay, but with easier to optimize hyper-parameters. Assumes that parameters are arranged (output dim x ... x input dim). Only affects parameters with more than one dimension. The method should normally be called after updateParameters. It uses the C/CUDA optimized torch.renorm function. Hint : maxOutNorm = 2 usually does the trick.

[momGradParams] Module:momentumGradParameters()

Returns a table of Tensors (momGradParams). For each element in the table, a corresponding parameter (params) and gradient w.r.t. parameters (gradParams) is returned by a call to parameters. This method is used internally by updateGradParameters.

Module:updateGradParameters(momFactor [, momDamp, momNesterov])

Applies classic momentum or Nesterov momentum (Sutskever, Martens et al, 2013) to parameter gradients. Each parameter Tensor (params) has a corresponding Tensor of the same size for gradients w.r.t. parameters (gradParams). When using momentum learning, another Tensor is added for each parameter Tensor (momGradParams). This method should be called before updateParameters as it affects the gradients w.r.t. parameters.

Classic momentum is computed as follows :

momGradParams = momFactor*momGradParams + (1-momDamp)*gradParams
gradParams = momGradParams

where momDamp has a default value of momFactor.

Nesterov momentum (momNesterov = true) is computed as follows (the first line is the same as classic momentum):

momGradParams = momFactor*momGradParams + (1-momDamp)*gradParams
gradParams = gradParams + momFactor*momGradParams

The default is to use classic momentum (momNesterov = false).

Module:weightDecay(wdFactor [, wdMinDim])

Decays the weight of the parameterized models. Implements an L2 norm loss on parameters with dimensions greater or equal to wdMinDim (default is 2). The resulting gradients are stored into the corresponding gradients w.r.t. parameters. Such that this method should be called before updateParameters.

Module:gradParamClip(cutoffNorm [, moduleLocal])

Implements a contrainst on the norm of gradients w.r.t. parameters (Pascanu et al. 2012). When moduleLocal = false (the default), the norm is calculated globally to Module for which this is called. So if you call it on an MLP, the norm is computed on the concatenation of all parameter Tensors. When moduleLocal = true, the norm constraint is applied to the norm of all parameters in each component (non-container) module. This method is useful to prevent the exploding gradient in Recurrent neural networks.