Refactored Denoising Autoencoder Code

It has been quite a challenge but I now have a working first attempt at Octave code for autoencoder pre-training of weights for a Conditional Restricted Boltzmann Machine, which is shown in the code box below.

% PRE-TRAINING FOR THE VISABLE TO HIDDEN AND THE VISIBLE TO VISIBLE WEIGHTS %%%%
% First create a training set and target set for the pre-training
dAuto_Encode_targets = batchdata ; dAuto_Encode_training_data = [] ;
  
  % loop to create the dAuto_Encode_training_data ( n1 == "order" of the gaussian layer of crbm )
  for ii = 1 : n1
  dAuto_Encode_training_data = [ dAuto_Encode_training_data shift( batchdata , ii ) ] ;
  end

% now delete the first n1 rows due to circular shift induced mismatch of data and targets
dAuto_Encode_targets( 1 : n1 , : ) = [] ; dAuto_Encode_training_data( 1 : n1 , : ) = [] ; 

% DO RBM PRE-TRAINING FOR THE BOTTOM UP DIRECTED WEIGHTS W %%%%%%%%%%%%%%%%%%%%%
% use rbm trained initial weights instead of using random initialisation for weights
% Doing this because we are not using regularisation in the autoencoder pre-training
epochs = 5000 ;
hidden_layer_size = 2 * size( dAuto_Encode_targets , 2 ) ;
w_weights = gaussian_rbm( dAuto_Encode_targets , minibatch , epochs , hidden_layer_size ) ;
A_weights = gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , size( dAuto_Encode_targets , 2 ) ) ;
B_weights = gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , hidden_layer_size ) ;

% create weight update matrices
A_weights_update = zeros( size( A_weights ) ) ;
B_weights_update = zeros( size( B_weights ) ) ;
w_weights_update = zeros( size( w_weights ) ) ;

% for adagrad
historical_A = zeros( size( A_weights ) ) ;
historical_B = zeros( size( B_weights ) ) ;
historical_w = zeros( size( w_weights ) ) ;

% set some training parameters
n = size( dAuto_Encode_training_data , 1 ) ; % number of training examples in dAuto_Encode_training_data
input_layer_size = size( dAuto_Encode_training_data , 2 ) ;
fudge_factor = 1e-6 ; % for numerical stability for adagrad
learning_rate = 0.1 ; % will be changed to 0.01 after 50 iters through epoch loop
mom = 0 ;             % will be changed to 0.9 after 50 iters through epoch loop
noise = 0.5 ;
epochs = 1000 ;
cost = zeros( epochs , 1 ) ;
lowest_cost = inf ;

  % Stochastic Gradient Descent training over dAuto_Encode_training_data 
  for iter = 1 : epochs
   
      % change momentum and learning_rate after 50 iters
      if iter == 50
      mom = 0.9 ;
      learning_rate = 0.01 ;
      end
  
      index = randperm( n ) ; % randomise the order of training examples
     
      for training_example = 1 : n
      
      % Select data for this training batch
      tmp_X = dAuto_Encode_training_data( index( training_example ) , : ) ;
      tmp_T = dAuto_Encode_targets( index( training_example ) , : ) ;
      
      % Randomly black out some of the input training data
      tmp_X( rand( size( tmp_X ) ) &lt; noise ) = 0 ;
      
      % feedforward tmp_X through B_weights and get sigmoid e.g ret = 1.0 ./ ( 1.0 + exp(-input) )
      tmp_X_through_sigmoid = 1.0 ./ ( 1.0 .+ exp( - B_weights * tmp_X' ) ) ;
      
      % Randomly black out some of tmp_X_through_sigmoid for dropout training
      tmp_X_through_sigmoid( rand( size( tmp_X_through_sigmoid ) ) &lt; noise ) = 0 ;
    
      % feedforward tmp_X through A_weights and add to tmp_X_through_sigmoid * w_weights for linear output layer
      final_output_layer = ( tmp_X * A_weights' ) .+ ( tmp_X_through_sigmoid' * w_weights ) ;
    
      % now do backpropagation
      % this is the derivative of weights for the linear final_output_layer
      delta_out = ( tmp_T - final_output_layer ) ;
      
      % NOTE! gradient of sigmoid function g = sigmoid(z) .* ( 1.0 .- sigmoid(z) )
      sig_grad = tmp_X_through_sigmoid .* ( 1 .- tmp_X_through_sigmoid ) ; 
      
      % backpropagation only through the w_weights that are connected to tmp_X_through_sigmoid
      delta_hidden = ( w_weights * delta_out' ) .* sig_grad ;
      
      % apply deltas from backpropagation with adagrad for the weight updates
      historical_A = historical_A .+ ( delta_out' * tmp_X ).^2 ;    
      A_weights_update = mom .* A_weights_update .+ ( learning_rate .* ( delta_out' * tmp_X ) ) ./ ( fudge_factor .+ sqrt( historical_A ) ) ;
      
      historical_w = historical_w .+ ( tmp_X_through_sigmoid * delta_out ).^2 ;
      w_weights_update = mom .* w_weights_update .+ ( learning_rate .* ( tmp_X_through_sigmoid * delta_out ) ) ./ ( fudge_factor .+ sqrt( historical_w ) ) ;
      
      historical_B = historical_B .+ ( delta_hidden * tmp_X ).^2 ;
      B_weights_update = mom .* B_weights_update .+ ( learning_rate .* ( delta_hidden * tmp_X ) ) ./ ( fudge_factor .+ sqrt( historical_B ) ) ;
      
      % update the weight matrices with weight_updates
      A_weights = A_weights + A_weights_update ;
      B_weights = B_weights + B_weights_update ;
      w_weights = w_weights + w_weights_update ;
      
      end % end of training_example loop
  
  % feedforward with this epoch's updated weights
  epoch_trained_tmp_X_through_sigmoid = ( 1.0 ./ ( 1.0 .+ exp( -( ( B_weights./2 ) * dAuto_Encode_training_data' ) ) ) ) ;
  epoch_trained_output = ( dAuto_Encode_training_data * ( A_weights./2 )' ) .+ ( epoch_trained_tmp_X_through_sigmoid' * ( w_weights./2 ) ) ;
 
  % get sum squared error cost
  cost( iter , 1 ) = sum( sum( ( dAuto_Encode_targets .- epoch_trained_output ) .^ 2 ) ) ;
  
    % record best so far
    if cost( iter , 1 ) &lt;= lowest_cost
       lowest_cost = cost( iter , 1 ) ;
       best_A = A_weights ./ 2 ;
       best_B = B_weights ./ 2 ;
       best_w = w_weights ./ 2;
    end
  
  end % end of backpropagation loop

lowest_cost                                        % print final cost to terminal
graphics_toolkit( "qt" ) ;
figure(1) ; plot( cost , 'r' , 'linewidth' , 3 ) ; % and plot the cost curve

% plot weights
graphics_toolkit( "gnuplot" ) ;
figure(2) ; surf( best_A ) ; title( 'Best A Weights' ) ;
figure(3) ; surf( best_B ) ; title( 'Best B Weights' ) ;
figure(4) ; surf( best_w ) ; title( 'Best w Weights' ) ;

% END OF CRBM WEIGHT PRE-TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Prior to this code being run there is a bit of data pre-processing to create data batches and their indexes, and of course afterwards the generated weights will be used as initial starting weights for the CRBM training.

I will not discuss the code very much in this post as it will probably be heavily revised in the near future. However, I will point out a few things about it, namely

I have used Restricted Boltzmann machine pre-training as initial weights for the autoencoder training
I have applied dropout to the single hidden layer in the autoencoder
early stopping, Momentum and AdaGrad are employed
bias units are not included
the code plots some charts for visual inspection (see below), which probably will not be present in the code’s final version

My reason for using RBM pre-training instead of random initialisation of weights is simple – RBM training generally seems to result in a lower average cost when the trained autoencoder itself is applied to the training data.

Below is a short video of a desktop slide show which shows typical, final, pre-trained weights from a sequence of 20 training sessions. The first 10 slides are random initialisation and the last 10 are initialisation with RBM training. Note the smoother cost curve for the RBM initialisation. Apart from this initialisation all other training parameters were exactly the same for all training sessions.

Non embedded view
The lower pane shows the evolution of cost on the Y axis vs. epoch on the X axis. The upper three panes show the weights on what are called the A, B and w weights respectively. A are the direct auto-regressive weights from training data to target data without any intervening hidden layers, B are the weights to the hidden sigmoid layer and w are the weights from the hidden layer to the targets. It is interesting to see how the A weights remain relatively consistent over the different sessions.

Leave a Reply Cancel reply