Full networks¶
WGAN_GP
¶

class
numpy_ml.neural_nets.models.
WGAN_GP
(g_hidden=512, init='he_uniform', optimizer='RMSProp(lr=0.0001)', debug=False)[source]¶ A Wasserstein generative adversarial network (WGAN) architecture with gradient penalty (GP).
Notes
In contrast to a regular WGAN, WGANGP uses gradient penalty on the generator rather than weight clipping to encourage the 1Lipschitz constraint:
\[ \text{Generator}(\mathbf{x}_1)  \text{Generator}(\mathbf{x}_2)  \leq \mathbf{x}_1  \mathbf{x}_2  \ \ \ \ \forall \mathbf{x}_1, \mathbf{x}_2\]In other words, the generator must have input gradients with a norm of at most 1 under the \(\mathbf{X}_{real}\) and \(\mathbf{X}_{fake}\) data distributions.
To enforce this constraint, WGANGP penalizes the model if the generator gradient norm moves away from a target norm of 1. See
WGAN_GPLoss
for more details.In contrast to a standard WGAN, WGANGP avoids using BatchNorm in the critic, as correlation between samples in a batch can impact the stability of the gradient penalty.
WGAPGP architecture:
X_real  >> [Critic] > Y_out Z > [Generator] > X_fake 
where
[Generator]
isFC1 > ReLU > FC2 > ReLU > FC3 > ReLU > FC4
and
[Critic]
isFC1 > ReLU > FC2 > ReLU > FC3 > ReLU > FC4
and
\[Z \sim \mathcal{N}(0, 1)\]Wasserstein generative adversarial network with gradient penalty.
Parameters:  g_hidden (int) – The number of units in the critic and generator hidden layers. Default is 512.
 init (str) – The weight initialization strategy. Valid entries are {‘glorot_normal’, ‘glorot_uniform’, ‘he_normal’, ‘he_uniform’, ‘std_normal’, ‘trunc_normal’}. Default is “he_uniform”.
 optimizer (str or Optimizer object or None) – The optimization strategy to use when performing gradient updates.
If None, use the
SGD
optimizer with default parameters. Default is “RMSProp(lr=0.0001)”.  debug (bool) – Whether to store additional intermediate output within
self.derived_variables
. Default is False.

forward
(X, module, retain_derived=True)[source]¶ Perform the forward pass for either the generator or the critic.
Parameters:  X (
ndarray
of shape (batchsize, *)) – Input data  module ({'C' or 'G'}) – Whether to perform the forward pass for the critic (‘C’) or for the generator (‘G’).
 retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True.
Returns:  out (
ndarray
of shape (batchsize, *)) – The output of the final layer of the module.  Xs (dict) – A dictionary with layer ids as keys and values corresponding to the input to each intermediate layer during the forward pass. Useful during debugging.
 X (

backward
(grad, module, retain_grads=True)[source]¶ Perform the backward pass for either the generator or the critic.
Parameters:  grad (
ndarray
of shape (batchsize, *) or list of arrays) – Gradient of the loss with respect to module output(s).  module ({'C' or 'G'}) – Whether to perform the backward pass for the critic (‘C’) or for the generator (‘G’).
 retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True.
Returns:  out (
ndarray
of shape (batchsize, *)) – The gradient of the loss with respect to the module input.  dXs (dict) – A dictionary with layer ids as keys and values corresponding to the input to each intermediate layer during the backward pass. Useful during debugging.
 grad (

update_critic
(X_real)[source]¶ Compute parameter gradients for the critic on a single minibatch.
Parameters: X_real ( ndarray
of shape (batchsize, n_feats)) – Input data.Returns: C_loss (float) – The critic loss on the current data.

update_generator
(X_shape)[source]¶ Compute parameter gradients for the generator on a single minibatch.
Parameters: X_shape (tuple of (batchsize, n_feats)) – Shape for the input batch. Returns: G_loss (float) – The generator loss on the fake data (generated during the critic update)

update
(module, module_loss=None)[source]¶ Perform gradient updates and flush gradients upon completion

fit
(X_real, lambda_, n_steps=1000, batchsize=128, c_updates_per_epoch=5, verbose=True)[source]¶ Fit WGAN_GP on a training dataset.
Parameters:  X_real (
ndarray
of shape (n_ex, n_feats)) – Training dataset  lambda (float) – Gradient penalty coefficient for the critic loss
 n_steps (int) – The maximum number of generator updates to perform. Default is 1000.
 batchsize (int) – Number of examples to use in each training minibatch. Default is 128.
 c_updates_per_epoch (int) – The number of critic updates to perform at each generator update.
 verbose (bool) – Print loss values after each update. If False, only print loss every 100 steps. Default is True.
 X_real (
BernoulliVAE
¶

class
numpy_ml.neural_nets.models.
BernoulliVAE
(T=5, latent_dim=256, enc_conv1_pad=0, enc_conv2_pad=0, enc_conv1_out_ch=32, enc_conv2_out_ch=64, enc_conv1_stride=1, enc_pool1_stride=2, enc_conv2_stride=1, enc_pool2_stride=1, enc_conv1_kernel_shape=(5, 5), enc_pool1_kernel_shape=(2, 2), enc_conv2_kernel_shape=(5, 5), enc_pool2_kernel_shape=(2, 2), optimizer='RMSProp(lr=0.0001)', init='glorot_uniform')[source]¶ A variational autoencoder (VAE) with 2D convolutional encoder and Bernoulli input and output units.
Notes
The VAE architecture is
 t_mean  X > [Encoder]  > [Sampler] > [Decoder] > X_recon  t_log_var 
where
[Encoder]
isConv1 > ReLU > MaxPool1 > Conv2 > ReLU > MaxPool2 > Flatten > FC1 > ReLU > FC2
[Decoder]
isFC1 > FC2 > Sigmoid
and
[Sampler]
draws a sample from the distribution\[\mathcal{N}(\text{t_mean}, \exp \left\{\text{t_log_var}\right\} I)\]using the reparameterization trick.
Parameters:  T (int) – The dimension of the variational parameter t. Default is 5.
 enc_conv1_pad (int) – The padding for the first convolutional layer of the encoder. Default is 0.
 enc_conv1_stride (int) – The stride for the first convolutional layer of the encoder. Default is 1.
 enc_conv1_out_ch (int) – The number of output channels for the first convolutional layer of the encoder. Default is 32.
 enc_conv1_kernel_shape (tuple) – The number of rows and columns in each filter of the first convolutional layer of the encoder. Default is (5, 5).
 enc_pool1_kernel_shape (tuple) – The number of rows and columns in the receptive field of the first max pool layer of the encoder. Default is (2, 3).
 enc_pool1_stride (int) – The stride for the first MaxPool layer of the encoder. Default is 2.
 enc_conv2_pad (int) – The padding for the second convolutional layer of the encoder. Default is 0.
 enc_conv2_out_ch (int) – The number of output channels for the second convolutional layer of the encoder. Default is 64.
 enc_conv2_kernel_shape (tuple) – The number of rows and columns in each filter of the second convolutional layer of the encoder. Default is (5, 5).
 enc_conv2_stride (int) – The stride for the second convolutional layer of the encoder. Default is 1.
 enc_pool2_stride (int) – The stride for the second MaxPool layer of the encoder. Default is 1.
 enc_pool2_kernel_shape (tuple) – The number of rows and columns in the receptive field of the second max pool layer of the encoder. Default is (2, 3).
 latent_dim (int) – The dimension of the output for the first FC layer of the encoder. Default is 256.
 optimizer (str or Optimizer object or None) – The optimization strategy to use when performing gradient updates.
If None, use the
SGD
optimizer with default parameters. Default is “RMSProp(lr=0.0001)”.  init (str) – The weight initialization strategy. Valid entries are {‘glorot_normal’, ‘glorot_uniform’, ‘he_normal’, ‘he_uniform’, ‘std_normal’, ‘trunc_normal’}. Default is ‘glorot_uniform’.

fit
(X_train, n_epochs=20, batchsize=128, verbose=True)[source]¶ Fit the VAE to a training dataset.
Parameters:  X_train (
ndarray
of shape (n_ex, in_rows, in_cols, in_ch)) – The input volume  n_epochs (int) – The maximum number of training epochs to run. Default is 20.
 batchsize (int) – The desired number of examples in each training batch. Default is 128.
 verbose (bool) – Print batch information during training. Default is True.
 X_train (
Word2Vec
¶

class
numpy_ml.neural_nets.models.
Word2Vec
(context_len=5, min_count=None, skip_gram=False, max_tokens=None, embedding_dim=300, filter_stopwords=True, noise_dist_power=0.75, init='glorot_uniform', num_negative_samples=64, optimizer='SGD(lr=0.1)')[source]¶ A word2vec model supporting both continuous bag of words (CBOW) and skipgram architectures, with training via noise contrastive estimation.
Parameters:  context_len (int) – The number of words to the left and right of the current word to use as context during training. Larger values result in more training examples and thus can lead to higher accuracy at the expense of additional training time. Default is 5.
 min_count (int or None) – Minimum number of times a token must occur in order to be included in vocab. If None, include all tokens from corpus_fp in vocab. Default is None.
 skip_gram (bool) – Whether to train the skipgram or CBOW model. The skipgram model
is trained to predict the target word i given its surrounding
context,
words[i  context:i]
andwords[i + 1:i + 1 + context]
as input. Default is False.  max_tokens (int or None) – Only add the first max_tokens most frequent tokens that occur more than min_count to the vocabulary. If None, add all tokens that occur more than than min_count. Default is None.
 embedding_dim (int) – The number of dimensions in the final word embeddings. Default is 300.
 filter_stopwords (bool) – Whether to remove stopwords before encoding the words in the corpus. Default is True.
 noise_dist_power (float) – The power the unigram count is raised to when computing the noise distribution for negative sampling. A value of 0 corresponds to a uniform distribution over tokens, and a value of 1 corresponds to a distribution proportional to the token unigram counts. Default is 0.75.
 init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’.
 num_negative_samples (int) – The number of negative samples to draw from the noise distribution for each positive training sample. If 0, use the hierarchical softmax formulation of the model instead. Default is 5.
 optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates
within the update method. If None, use the
SGD
optimizer with default parameters. Default is None.
Variables:  parameters (dict) –
 hyperparameters (dict) –
 derived_variables (dict) –
 gradients (dict) –
Notes
The word2vec model is outlined in in [1].
CBOW architecture:
w_{tR}  w_{tR+1}  ... > Average > Embedding layer > [NCE Layer / HSoftmax] > P(w_{t}  w_{...}) w_{t+R1}  w_{t+R} 
Skipgram architecture:
> P(w_{tR}  w_{t}) > P(w_{tR+1}  w_{t}) w_{t} > Embedding layer > [NCE Layer / HSoftmax]  ... > P(w_{t+R1}  w_{t}) > P(w_{t+R}  w_{t})
where \(w_{i}\) is the onehot representation of the word at position i within a sentence in the corpus and R is the length of the context window on either side of the target word.
References
[1] Mikolov et al. (2013). “Distributed representations of words and phrases and their compositionality,” Proceedings of the 26th International Conference on Neural Information Processing Systems. https://arxiv.org/pdf/1310.4546.pdf 
forward
(X, targets, retain_derived=True)[source]¶ Evaluate the network on a single minibatch.
Parameters:  X (
ndarray
of shape (n_ex, n_in)) – Layer input, representing a minibatch of n_ex examples, each consisting of n_in integer word indices  targets (
ndarray
of shape (n_ex,)) – Target word index for each example in the minibatch.  retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default True.
Returns:  loss (float) – The loss associated with the current minibatch
 y_pred (
ndarray
of shape (n_ex,)) – The conditional probabilities of the words in targets given the corresponding example / context in X.
 X (

get_embedding
(word_ids)[source]¶ Retrieve the embeddings for a collection of word IDs.
Parameters: word_ids ( ndarray
of shape (M,)) – An array of word IDs to retrieve embeddings for.Returns: embeddings ( ndarray
of shape (M, n_out)) – The embedding vectors for each of the M word IDs.

minibatcher
(corpus_fps, encoding)[source]¶ A minibatch generator for skipgram and CBOW models.
Parameters:  corpus_fps (str or list of strs) – The filepath / list of filepaths to the document(s) to be encoded. Each document is expected to be encoded as newlineseparated string of text, with adjacent tokens separated by a whitespace character.
 encoding (str) – Specifies the text encoding for corpus. This value is passed directly to Python’s open builtin. Common entries are either ‘utf8’ (no header byte), or ‘utf8sig’ (header byte).
Yields:  X (list of length batchsize or
ndarray
of shape (batchsize, n_in)) – The context IDs for a minibatch of batchsize examples. Ifself.skip_gram
is False, X will be a ragged list consisting of batchsize variablelength lists. Ifself.skip_gram
is True, all sublists will be of the same length (n_in) and X will be returned as andarray
of shape (batchsize, n_in).  target (
ndarray
of shape (batchsize, 1)) – The target IDs associated with each example in X

fit
(corpus_fps, encoding='utf8sig', n_epochs=20, batchsize=128, verbose=True)[source]¶ Learn word2vec embeddings for the examples in X_train.
Parameters:  corpus_fps (str or list of strs) – The filepath / list of filepaths to the document(s) to be encoded. Each document is expected to be encoded as newlineseparated string of text, with adjacent tokens separated by a whitespace character.
 encoding (str) – Specifies the text encoding for corpus. Common entries are either ‘utf8’ (no header byte), or ‘utf8sig’ (header byte). Default value is ‘utf8sig’.
 n_epochs (int) – The maximum number of training epochs to run. Default is 20.
 batchsize (int) – The desired number of examples in each training batch. Default is 128.
 verbose (bool) – Print batch information during training. Default is True.