It has been quite a challenge but I now have a working first attempt at Octave code for autoencoder pre-training of weights for a Conditional Restricted Boltzmann Machine, which is shown in the code box below.

```
% PRE-TRAINING FOR THE VISABLE TO HIDDEN AND THE VISIBLE TO VISIBLE WEIGHTS %%%%
% First create a training set and target set for the pre-training
dAuto_Encode_targets = batchdata ; dAuto_Encode_training_data = [] ;
% loop to create the dAuto_Encode_training_data ( n1 == "order" of the gaussian layer of crbm )
for ii = 1 : n1
dAuto_Encode_training_data = [ dAuto_Encode_training_data shift( batchdata , ii ) ] ;
end
% now delete the first n1 rows due to circular shift induced mismatch of data and targets
dAuto_Encode_targets( 1 : n1 , : ) = [] ; dAuto_Encode_training_data( 1 : n1 , : ) = [] ;
% DO RBM PRE-TRAINING FOR THE BOTTOM UP DIRECTED WEIGHTS W %%%%%%%%%%%%%%%%%%%%%
% use rbm trained initial weights instead of using random initialisation for weights
% Doing this because we are not using regularisation in the autoencoder pre-training
epochs = 5000 ;
hidden_layer_size = 2 * size( dAuto_Encode_targets , 2 ) ;
w_weights = gaussian_rbm( dAuto_Encode_targets , minibatch , epochs , hidden_layer_size ) ;
A_weights = gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , size( dAuto_Encode_targets , 2 ) ) ;
B_weights = gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , hidden_layer_size ) ;
% create weight update matrices
A_weights_update = zeros( size( A_weights ) ) ;
B_weights_update = zeros( size( B_weights ) ) ;
w_weights_update = zeros( size( w_weights ) ) ;
% for adagrad
historical_A = zeros( size( A_weights ) ) ;
historical_B = zeros( size( B_weights ) ) ;
historical_w = zeros( size( w_weights ) ) ;
% set some training parameters
n = size( dAuto_Encode_training_data , 1 ) ; % number of training examples in dAuto_Encode_training_data
input_layer_size = size( dAuto_Encode_training_data , 2 ) ;
fudge_factor = 1e-6 ; % for numerical stability for adagrad
learning_rate = 0.1 ; % will be changed to 0.01 after 50 iters through epoch loop
mom = 0 ; % will be changed to 0.9 after 50 iters through epoch loop
noise = 0.5 ;
epochs = 1000 ;
cost = zeros( epochs , 1 ) ;
lowest_cost = inf ;
% Stochastic Gradient Descent training over dAuto_Encode_training_data
for iter = 1 : epochs
% change momentum and learning_rate after 50 iters
if iter == 50
mom = 0.9 ;
learning_rate = 0.01 ;
end
index = randperm( n ) ; % randomise the order of training examples
for training_example = 1 : n
% Select data for this training batch
tmp_X = dAuto_Encode_training_data( index( training_example ) , : ) ;
tmp_T = dAuto_Encode_targets( index( training_example ) , : ) ;
% Randomly black out some of the input training data
tmp_X( rand( size( tmp_X ) ) < noise ) = 0 ;
% feedforward tmp_X through B_weights and get sigmoid e.g ret = 1.0 ./ ( 1.0 + exp(-input) )
tmp_X_through_sigmoid = 1.0 ./ ( 1.0 .+ exp( - B_weights * tmp_X' ) ) ;
% Randomly black out some of tmp_X_through_sigmoid for dropout training
tmp_X_through_sigmoid( rand( size( tmp_X_through_sigmoid ) ) < noise ) = 0 ;
% feedforward tmp_X through A_weights and add to tmp_X_through_sigmoid * w_weights for linear output layer
final_output_layer = ( tmp_X * A_weights' ) .+ ( tmp_X_through_sigmoid' * w_weights ) ;
% now do backpropagation
% this is the derivative of weights for the linear final_output_layer
delta_out = ( tmp_T - final_output_layer ) ;
% NOTE! gradient of sigmoid function g = sigmoid(z) .* ( 1.0 .- sigmoid(z) )
sig_grad = tmp_X_through_sigmoid .* ( 1 .- tmp_X_through_sigmoid ) ;
% backpropagation only through the w_weights that are connected to tmp_X_through_sigmoid
delta_hidden = ( w_weights * delta_out' ) .* sig_grad ;
% apply deltas from backpropagation with adagrad for the weight updates
historical_A = historical_A .+ ( delta_out' * tmp_X ).^2 ;
A_weights_update = mom .* A_weights_update .+ ( learning_rate .* ( delta_out' * tmp_X ) ) ./ ( fudge_factor .+ sqrt( historical_A ) ) ;
historical_w = historical_w .+ ( tmp_X_through_sigmoid * delta_out ).^2 ;
w_weights_update = mom .* w_weights_update .+ ( learning_rate .* ( tmp_X_through_sigmoid * delta_out ) ) ./ ( fudge_factor .+ sqrt( historical_w ) ) ;
historical_B = historical_B .+ ( delta_hidden * tmp_X ).^2 ;
B_weights_update = mom .* B_weights_update .+ ( learning_rate .* ( delta_hidden * tmp_X ) ) ./ ( fudge_factor .+ sqrt( historical_B ) ) ;
% update the weight matrices with weight_updates
A_weights = A_weights + A_weights_update ;
B_weights = B_weights + B_weights_update ;
w_weights = w_weights + w_weights_update ;
end % end of training_example loop
% feedforward with this epoch's updated weights
epoch_trained_tmp_X_through_sigmoid = ( 1.0 ./ ( 1.0 .+ exp( -( ( B_weights./2 ) * dAuto_Encode_training_data' ) ) ) ) ;
epoch_trained_output = ( dAuto_Encode_training_data * ( A_weights./2 )' ) .+ ( epoch_trained_tmp_X_through_sigmoid' * ( w_weights./2 ) ) ;
% get sum squared error cost
cost( iter , 1 ) = sum( sum( ( dAuto_Encode_targets .- epoch_trained_output ) .^ 2 ) ) ;
% record best so far
if cost( iter , 1 ) <= lowest_cost
lowest_cost = cost( iter , 1 ) ;
best_A = A_weights ./ 2 ;
best_B = B_weights ./ 2 ;
best_w = w_weights ./ 2;
end
end % end of backpropagation loop
lowest_cost % print final cost to terminal
graphics_toolkit( "qt" ) ;
figure(1) ; plot( cost , 'r' , 'linewidth' , 3 ) ; % and plot the cost curve
% plot weights
graphics_toolkit( "gnuplot" ) ;
figure(2) ; surf( best_A ) ; title( 'Best A Weights' ) ;
figure(3) ; surf( best_B ) ; title( 'Best B Weights' ) ;
figure(4) ; surf( best_w ) ; title( 'Best w Weights' ) ;
% END OF CRBM WEIGHT PRE-TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
```

Prior to this code being run there is a bit of data pre-processing to create data batches and their indexes, and of course afterwards the generated weights will be used as initial starting weights for the CRBM training.

I will not discuss the code very much in this post as it will probably be heavily revised in the near future. However, I will point out a few things about it, namely

- I have used Restricted Boltzmann machine pre-training as initial weights for the autoencoder training
- I have applied dropout to the single hidden layer in the autoencoder
- early stopping, Momentum and AdaGrad are employed
- bias units are not included
- the code plots some charts for visual inspection (see below), which probably will not be present in the code’s final version

Non embedded view

The lower pane shows the evolution of cost on the Y axis vs. epoch on the X axis. The upper three panes show the weights on what are called the A, B and w weights respectively. A are the direct auto-regressive weights from training data to target data without any intervening hidden layers, B are the weights to the hidden sigmoid layer and w are the weights from the hidden layer to the targets. It is interesting to see how the A weights remain relatively consistent over the different sessions.