Building an Efficient Self-Organizing Neural Network
The Neuton Neural Network Framework is based on a patented machine learning algorithm that forgoes
error backpropagation and stochastic gradient descent. It provides automatic neuron-by-neuron
network structure growth, and allows for minimum-size models with excellent generalizing capability,
and without a loss of accuracy.
Traditional approaches to building neural networks
Neural networks created today contain more and more coefficients and neurons and require
ever-increasing processing power. Hundreds of thousands of neural network parameters have long
ceased to be anything surprising or unique. However, it is now obvious that this approach has its
limitations and will soon hit insurmountable limits to hardware capacity.
When neural network structures are being built it is, generally, a highly manual and somewhat random
process. The reality is that one simply has to adjust too many variables simultaneously to build an
optimal model from the size and accuracy perspective, including, but not limited to:
Number on Neurons
Number of Layers
Activation Function (Sigmoid, ReLU, etc)
Number of epoches
Cross Validation Folds
An overwhelming majority of modern neural networks are based on a predetermined architecture
(structure) defined by the researcher and the method of stochastic gradient descent with minor
modifications for parametric identification. Only neuron parameters undergo optimization, while the
architecture itself remains predetermined, defined by the researcher and constant. This is the main
cause of the unnecessary growth of network sizes, which leads to increased prediction costs using
redundant calculations of a determined network.
The world scientific community is seriously concerned with solving this problem. Two main approaches
to reducing network volume that can currently be distinguished are:
1. Optimizing the structures of already-trained networks
2. Automated neural architecture search (NAS)
The methods and algorithms that implement the first approach mainly come down to discarding
"ineffective" neurons and connections in an already-trained network that meet certain criteria. The
inevitable trade-off for reduced network volume is loss of accuracy. Furthermore, the large network
still has to be trained. In other words, the issue of large size can be solved at the operational
The second, and undoubtedly more promising approach, allows for generation of optimized network
architectures that match or exceed the performance of manually created architectures. Attempts at
including an automated neural network structure definition in the optimization circuit mostly lead
to intelligent enumeration of finished architectures, parametric identification and selection of the
best option. Each time, the model is fully trained using the candidate architecture. Thus the
process of building a network using this approach is as follows:
Note that this is a very extensive and resource-intensive process, so in a real-life scenario, it is
severely restricted to the search space of various architectures combinations. We are forced to
perform a “highly discretized” enumeration and, as a result, end up with an non-ideal architecture.
It is also worth noting that in order to achieve a consistent outcome, the approach needs to be
implemented in the context of model cross-validation, which multiplies the already-high overhead by
Another obstacle to obtaining an efficiently sized, highly accurate model is the choice of the
optimization algorithm. The widely-known problem of local extremes and plateaus significantly
reduces the efficiency of using stochastic gradient descent for these purposes. In addition,
significant variation in the hyperparameters, such as the learning rate, batch size or weight
initialization technique, as well as complex and ambiguous detection of when the learning ends, all
add a lot of unknowns to this process thereby increasing the cost of each step and making the
Let us list the main problems that arise with the use of local gradient optimization methods in
modern neural network frameworks:
Getting stuck in multiple local minima or at saddle points. Due to the complex landscape of the
target function, the plateau regions alternate with regions of strong non-linearity. The derivative
on the plateau is almost zero, and a sudden drop, on the contrary, can guide the search algorithm
too far from the desired optimum.
Non-uniform parameter updates. Certain parameters are updated much less frequently than others,
especially where the data contains informative but rare attributes. This adversely affects the
subtleties of the network generalization rule. That said, assigning too much importance to all rare
attributes can lead to overfitting.
Undetermined learning rate. A learning rate that is too low causes the algorithm to take a very long
time to converge, getting stuck in local minima. Conversely, a very high learning rate leads to
skipping of preferred minima or even to divergence.
The issue of vanishing and exploding gradients. The presence of a large number of successive layers
in a neural network leads to an uncontrolled decrease or increase in the error gradient as weight
correction progresses from the network output to the input. This is reflected in the learning
efficiency of the neural network layers that are located far from the output.
Major modifications of stochastic gradient descent use various heuristics in an attempt to address
these challenges. The most popular of these is the idea of accumulating momentum when moving along
the gradient and the idea of weaker weight updates for typical attributes. A whole series of
algorithms has been created from these ideas: Nesterov Accelerated Gradient, Adagrad, Momentum,
RMSProp, Adadelta, Adam, Adamax. However, even such a large number of algorithms cannot guarantee a
high-quality solution to all of the problems mentioned above, simply demonstrating that the
scientific community continues to pursue an intensive search in this direction.
To sum up the above, we will note that successful implementation of an algorithm for creating a
neural network with an ideal structure requires a drastic change in the approach to building neural
networks. In particular, this calls for a solution to two main problems: the inefficiency of the
training algorithm and the discreteness of selecting an ideal architecture.
After analyzing and summarizing the experiences of the world’s scientific communities, we designed a
completely different approach to creation of perceptron neural networks with an optimal architecture
that is free from the aforementioned flaws.
Unlike most NAS methods, based on intelligent enumeration of predetermined neural network
structures, we use neuron-by-neuron network structure growth, with the minimum structural unit
involved in the optimization process being the neuron’s input. This allows minimization of the
“discretization” of the architecture search and creates minimum-size neural networks with no loss of
We thereafter created and patented our own highly effective global optimization algorithm, as an
efficient solution to the problems of local extremes and plateaus. Using the algorithm for
identifying network parameters helps to significantly improve each neuron’s efficiency in the
network and to reduce the network’s volume as a result. The patented algorithm has enormous
potential for parallelizing (multiple hosts, multiple GPUs) without loss of accuracy, allowing us to
solve cross-validation problems while training a neural network within an acceptable time.
We have named the neural network framework “Neuton”. It is based on a process of automatic
neuron-by-neuron network growth with overfitting control. This approach allows dynamic growth of the
neural network until it achieves its maximum generalization ability. The use of our own global
optimization algorithm when learning the parameters of each neuron allows for a significant
reduction in the volume of the network, while maintaining its accuracy characteristics.
The key differences between Neuton and traditional neural network frameworks are:
Fully automatic creation of a neural network structure without a data scientist’s involvement
Built-in cross-validation algorithm
Built-in overfitting control
Application of global optimization methods
A high parallelizing ability with no loss of accuracy
The minimum size of a trained neural network possible without a loss of accuracy
Maximum prediction speed
It is important to note that all of the above-mentioned benefits of Neuton are not separate
algorithm settings, but rather are all implemented automatically, by default. Data, a target
variable and a metric name are fed to the algorithm, and the entire process of training, validation
and production of the best model happens without a need for a data scientist.
demonstrate that neural networks created with Neuton possess a high level of accuracy, at minimal
model size, relative to alternative solutions.