Under construction: raw and unedited without links!

Part IV

If the human brain were so simple that we could understand it, we would be so simple that we couldn't!!!

If you think to build a tower, first reckon up the cost. - St. Jerome

13 Hierarchical Systems

So far the networks we have looked at have consisted of only one or two layers of neurodes: an input layer and possibly an output layer. Because they have so few layers, they are only able to take advantage of any natural coding of information that already exists in their input. These networks do not have the ability to interpret their input data or to organize them into any kind of internal worldview.

About twenty years ago, Marvin Minsky and Seymour Papert of MIT wrote a book, Perceptrons, which proved that l- or 2-layered perceptron networks were inadequate for many real-world problems. Their book, combined with other contributing factors of the time, was so influential that neural network research and development was brought to a near-standstill for almost two decades. Only a few die-hard researchers continued to work in the field, and they had a great deal of difficulty in obtaining funding, tenure, and promotions. We need to look at Minsky and Papert's arguments to understand why they wrote what they did and why it had such a tremendous impact on the field. The discussion in Perceptrons was a thorough piece of reasoning. Minsky and Papert performed a careful analysis of the problem of mapping one pattern to another. In this context, mapping simply means association. That is, when we map A to 1, B to 2, and so on, we are correlating the letters with numbers. In this view, a mathematical function is a mapping of the function's value to each value of the variable(s). In many cases, all we want a neural network to do is to provide such a mapping.

...

Minsky and Papert concluded that it would be impossible for simple perceptron networks ever to solve problems with this characteristic. In other words, it appeared at the time that neural networks could solve only problems where similar input patterns mapped to similar output patterns. Unfortunately, many real-world problems, such as the parity problem and the exclusive OR problem, do not have this characteristic. The outlook appeared gloomy for neural networks researchers. Minsky and Papert were correct in their analysis of perceptron neural networks. It eventually became clear, however, that what was needed to correct the problem was to make the networks slightly more complex. In other words, although a two-layer network cannot solve such problems, a three-layer network can. While Minsky and Papert recognized that this was possible, they felt it unlikely that a training method could be developed to find a multi-layered network that could solve these problems. As it turns out, there is strong evidence that multilayered networks intrinsically have significantly greater capabilities than one- or two-layered networks. A mathematical theorem exists that proves that a three-layer network can perform all real mappings; it is called Kolmogorov's theorem.

Kolmogorov's Theorem

In the mid-1950s Soviet mathematician A. N. Kolmogorov published the proof of a mathematical theorem that provides a sound basis for mapping networks.

...

Let's step back from this theoretical discussion and try to describe Kolmogorov's result more concretely. When we build a neural network of three layers, we are generating a system that performs the desired mapping in a two-step process. First, in moving from layer 1 to layer 2, the input pattern is translated into an internal representation that is specific and private to this network (thus the frequently used term hidden for the middle layer of a multilayer network).

Second, when the activity of the network moves from layer 2 to layer 3, this internal representation of the pattern is translated into the desired output pattern. The middle layer of the network somehow implements an internal code that the network uses to store the correct mapping instructions. This is important to understand because it is one of the chief reasons that a hierarchical, multilayer neural network is so much more powerful than a simple neural network. Adding a hierarchy of layers to the system allows for complex internal representations of the input patterns that are not possible with simpler systems. The internal representation generated by the hierarchical network may or may not be one that is meaningful to us as humans. Researchers have spent a great deal of time reverse engineering trained, multilayer networks to try to decipher the codes they use. A couple of important points emerge from such studies. First, the representation that the network develops is not cast in concrete. If the network is reinitialized (the weights are randomly set to new initial values) and the network retrained on the same training data in the same training regimen, the internal representation developed the second time will generally be similar to but not identical with the first representation. Furthermore, there is no way to predict which neurode will encode any specific portion of the representation. The second important point is that the encoding used by the network may or may not have a bearing on any encoding scheme animals use in their brains, nor need it make any particular sense to us. While reverse engineering a trained neural network can provide clues to the operation of biological networks, it is dangerous to take such clues too seriously and assume that biological networks have to work the same way.

What Kolmogorov Didn't Say

There are some questions that Kolmogorov's theorem does not answer. For example, it does not tell us that the network described is ' the most efficient implementation of a network for this mapping. Nor does it tell us whether there is a network with fewer neurodes that can also do this mapping. And, of course, the functions used in the Kolmogorov network are not specifically defined by the theorem.

Kolmogorov's theorem assures us that we need not go to hundreds or thousands of layers to make a good mapping; there is no need for neural network skyscrapers. Instead the theorem demonstrates that there is in fact a way to do any mapping we choose in as few as three layers.

This agrees with our knowledge about biological systems.

Our brains are incredibly complex, but the number of processing layers for any particular subsystem is remarkably small for the power of its operation.

You may notice something else about the Kolmogorov multilayer network. It appears that within each layer of the network, there is little interaction among neurodes. Instead neurodes take inputs from the previous layer and fan their output to the next layer, but they do not receive inputs or generate outputs to other neurodes on the same layer. This is quite typical of many neural network architectures:

Such a hierarchical structure is similar to the organization of biological systems. In fact, it now appears that much of the brain is organized as functional modules arranged in a parallel set of increasing hierarchies. At many vertical levels within a given subsystem of the brain, additional interaction occurs from other hierarchical subsystems, allowing, for example, the visual system to interact with the motor system.

In the next chapters, we take a close look at some hierarchical neural network architectures. ...

Application: The Neocognitron

...

Let's first review the neocognitron's general operation. The analog pattern to be recognized-one of a set of memorized symbols, for instance-is presented to the input stage. The second stage recognizes the constituent small features, or geometric primitives, of the symbol, and each succeeding stage responds to larger and larger groupings of these small features. Finally. one neurode in the output stage recognizes the complete symbol. There must be as many neurodes in the output stage as there are symbols in the set to be recognized. ...

Like the pace of a crab, backward. ' Robert Greene

Backpropagation Networks

Arguably the most successful and certainly one of the most studied learning systems in the neural network arena is backward error propagation learning or, more commonly, backpropagation. Backpropagation has perhaps more near-term application potential than any other learning system in use today. Researchers have used this method to teach neural networks to speak, play games such as backgammon, and distinguish between sonar echoes from rocks and mines. In these and other applications, backpropagation has demonstrated impressive performance, often comparable only to the far more complex adaptive resonance systems discussed in chapter16.

Features of Backpropagation Systems

Each training pattern presented to a backpropagation network is processed in two stages. In the first stage, the input pattern presented to the network generates a forward flow of activation from the input to the output layer. In the second stage, errors in the network's output generate a flow of information from the output layer backward to the input layer. It is this feature that gives the network its name. This backward propagation of errors is used to modify the weights on the interconnects of the network, allowing the network to learn.

Backpropagation is actually a learning algorithm rather than a network design. It can be used with a variety of architectures. The architectures themselves are less important than their common features: all are hierarchical, all use a high degree of interconnection between layers, and all have nonlinear transfer functions. ... Another way to think of the action of the middle layer is that it creates a map relating each input pattern to a unique output response. We have already seen one such mapping network, the Kohonen feature map. Because the backward transmission of errors allows backpropagation networks to generate a sophisticated and accurate internal representation of the input data, they are often more versatile than these other mapping networks. For example, counterpropagation networks, discussed in chapter 15, can have difficulty with discontinuous mappings or with mappings in which small changes in the input do not correspond to small changes in the output. Backpropagation networks can generally learn these mappings, often with great reliability. The middle layer, and thus the hierarchical structure of the system, is the source of the improved internal representation of backpropagation networks. Physically this representation exists within the synapses of the interconnects of the middle layers of the network; the hither the interlayer connectivity is, the better is the ability of the network to build a representation or model of the input data.

Building a Backpropagation System

Let's construct a typical backpropagation system in our minds and see the way in which it works. First we need to select an appropriate architecture. For our purposes, three feed-forward hierarchical layers are sufficient, with each layer fully connected to the following layer.

...

The Backpropagation Process

To teach our imaginary network something using backpropagation, we must start by setting all the adaptive weights on all the neurodes in it to random values. It won't matter what those values are, as long as they are not all the same and nor equal to 1. To train the network, we need a training set of input patterns along with a corresponding set of desired output patterns. The first step is to apply the first pattern from the training set and observe the resulting output pattern. Since we know what result we are supposed to get, we can easily compute an error for the network's output; it is the difference between the actual and the desired outputs. This should sound familiar. We encountered the same rationale when we talked about the adaline in chapter 6. In that chapter, we used the delta rule which computed the distance of the current weight vector from some ideal value and then adjusted the weights according to that computed distance. The learning rule -that backpropagation uses is quite similar. It is a variation of the original delta rule called the "generalized delta rule." The original delta rule allowed us to adjust the weights using the following formula: multiply the error on each neurode's output by the size of that output and by a learning constant to determine the amount to change each neurode's weights. By this formula, the change in the weight vector is proportional to the error and parallel to the input vector. ...

...

Let's review the entire backpropagation process. First we present an input pattern at the input layer of the network, and this generates activity in the input-layer neurodes. We allow this activity to propagate forward through each of the layers of the network until the output layer generates an output pattern. Remember that we initially set the weights on all modifiable interconnects randomly, so we are almost guaranteed that the first pass through the network will generate the wrong output. We compare this output pattern to the desired output pattern in order to evaluate errors that are propagated backward through the layers of the network, changing the weights of each layer as it passes.

This complete round of forward activation propagation and backward error propagation constitutes one iteration of the network. From here, we can present the same input pattern to the network again, or we can modify it and present a different input pattern, depending on what we are trying to accomplish. In any event, we do complete iterations of the network every time we present an input pattern for as long as we are training the network.

Limitations of Backpropagation Networks

Backpropagation is computationally quite complex. Many iterations, often hundreds or even thousands, are usually required for the network to learn a set of input patterns. This causes a backpropagation network to be a risky choice for applications that require learning to occur in real time, that is, on the fly. Of course, many applications do not require that learning occur in real time, only that the network be able to respond to input patterns as they are presented after it has been trained.

There is still more bad news about backpropagation systems, however. Backpropagation, unlike the counterpropagation system we will look at in chapter 15, is not guaranteed to arrive at a correct solution. It is a gradient descent system, just as the adaline was. For this reason it is bound by the problems of a class of systems called hill-climbing algorithms-or rather hill descent in this case. Readers familiar with AI jargon will have heard this term before. The hill descent problem asks, "How do you find the bottom of a hill?"

One commonsense solution is simply to always walk downhill, which is exactly what gradient-descent algorithms do. If you have ever tried this on a real hill, however, you know that there is a hitch: you sometimes find yourself in the bottom of a dip halfway down the hill and are forced to climb out of the dip in order to continue downhill. If the dip has steep sides, this commonsense approach to getting down the hill may not actually get you there at all but may strand you in a dip part way down. The feature corresponding to a dip in the gradient descent method is called a local minimum.

Gradient-descent algorithms are always subject to this problem. There is no guarantee that they will lead the network to the bottom of the hill; they can be sidetracked by local minima and end up stranded with no further means of arriving at the bottom of the hill. In particular, although the adaline had a smooth parabolic bowl for its error function, the complex architecture of the backpropagation network has an equally complex error function with the potential of having many local hills and dips between the network's starting error and the desired minimum error position. In practice, backpropagation systems have been found to be remarkably good at finding the bottom of the hill. Even so, nearly every researcher has found that some trials do not work, and their backpropagation system fails to find the correct answer. A great deal of research is being conducted to determine how to identify such cases in advance or otherwise escape from local minima.

One method modifies the delta rule still more than the generalized delta rule does by adding a feature called the "momentum term." Consider how this might work. A sled moving down a snow-covered hill is a perfect example of a gradient-descent system. It has no internal power to move itself uphill unless the rider gets off and pulls it. However, a sled can overcome small bumps or even short rises in its path if it has generated enough physical momentum to carry it over such perturbations and to allow it to continue in its original direction, downhill. The momentum term in the modified delta rule works in the same fashion.

Adding momentum to the weight change law is easy. We just add a term to the existing: formula that depends on the size and direction of the weight change in the previous iteration of the network. To use our sled analogy, each new iteration "remembers" the direction and speed it had in the last iteration. If the algorithm finds itself in a local minimum, this momentum term may make enough contribution to the formula to carry the system out the other side so it can continue on its way "downhill."

This momentum term adds to the complexity of an already tedious calculation. Do we really need it? The answer is that we do not need it, but we may want it. Research suggests that we can accomplish nearly the same result by making the weight change for each learning step very small. Of course, this action causes the network to require many more iterations to learn a pattern, and including the momentum term is usually the more desirable solution of the two.

The generalized delta rule is the most common implementation of backpropagation; however, there are variations on this scheme.

Variations of the Generalized Delta Rule

Many researchers have offered variations on the generalized delta rule theme. In general, these attempt to decrease the number of iterations required for the network to learn, to reduce the computational complexity of the network, or to improve the local computability of the network.

...

Scaling Problems

There is one more serious drawback to backpropagation networks: they do not scale up very well from small research systems to the larger ones required for real-world uses. ...

This scaling problem restricts the applicability of backpropagation to problems that can be solved with relatively small networks. There are many such problems, however, and sometimes a collection of small backpropagation networks can be used to solve large problems. Also even small backpropagation networks can master surprisingly difficult tasks. ...

Biological Arguments against Backpropagation

In addition to these pragmatic difficulties, backpropagation also faces other objections. One argument is that this learning system is not biologically plausible. These critics deem it unlikely that animal brains use backpropagation for learning. One reason they believe this is that while our brains do have reverse neural pathways-for instance, from the brain back to the eye-these are not the same interconnects and synapses that provide the "forward" activity. Recall that a backpropagation system traditionally uses the same interconnects for both the forward and backward passes through the network. A second and more serious reason that some critics believe backpropagation is not biologically plausible is that it requires each neurode in the network to compute its output and weight changes based on conditions that are not local to that neurode. Specifically the neurode relies on knowledge of the errors in the next higher layer of the network to compute changes in its own weights. Most current research seems to support the idea that real synapses change their weights in response only to locally available information rather than relying on information about the activity of neurons farther up the computational chain. This lack of reliance on locally available information is thought by many to disqualify backpropagation systems as serious biological models.

Recently researchers investigating this issue have suggested that there is a backpropagation formulation consistent with biological systems and that requires only locally available information at each neurode to adjust synapse weights. If this proves to be the case, some critics of backpropagation will be silenced. To many supporters of the method, however, nothing will be changed because they were never concerned that it did not possess biological plausibility. We suggest that biological plausibility need not be weighted too heavily in the development of neural network paradigms. It is true that biological systems are good models for network architectures; they furnish architectures we know will work. Except for researchers who are actually trying to model the brain, however, there seems little reason to reject an effective system just because it is unlikely to be an accurate model of biological systems.

pg. 194...

[So what is the bias of Samsara? Since it is built into the structure of the neural networks by which we transform the holoprocess into a Self that experiences a reality of objects and consciousness, it is almost impossible to conceive that the brain is a secondary processor whose context is not the reality out there but a primary holoprocess. This holoprocess is like a hypercube ("Wish fulfilling Gem") that the Self reduces to lower dimensional slices. So from the perspective of lower dimensional slices, we model the brains activities in adjustment of weights as being about "learning, memory and data acquisition", when in my model it is about creating and modifying neural network systems. Thus we are growing ways to create knowledge, not learning knowledge that is already present.

This next section is about other ways competition and filtering can take place. Why all this fuss about inhibition and winner take all? The driving force of human culture is the skill of ignorance! The act of ignoring is what we call concentration and what I call taking a slice in a lower dimension of a higher dimensional object. Being able to inhibit and ignore is the basis of separation from a chaos of unity of consciousness in the holomind and the basis of an ego.]

pg. 202-205

The Counterpropagation Network

The name counterpropagation derives from the initial presentation of this network as a five-layered network with data flowing inward from both sides, through the middle layer and out the opposite sides. There is literally a counterflow of data through the network. Although this is an accurate picture of the network, it is unnecessarily complex; we can simplify it considerably with no loss of accuracy. In the simpler view of the counterpropagation network, it is a three-layered network. The input layer is a simple fan-out layer presenting the input pattern to every neurode in the middle layer. The middle layer is a straightforward Kohonen layer, using the competitive filter learning scheme discussed in chapter 7. Such a scheme ensures that the middle layer will categorize the input patterns presented to it and will model the statistical distribution of the input pattern vectors. The third, or output layer of the counterpropagation network is a simple outstar array. The outstar, you may recall, can be used to associate a stimulus from a single neurode with an output pattern of arbitrary complexity. In operation, an input pattern is presented to the counterpropagation network and distributed by the input layer to the middle, Kohonen layer. Here the neurodes compete to determine that neurode with the strongest response (the closest weight vector) to the input pattern vector. That winning neurode generates a strong output signal (usually a +1) to the next layer; all other neurodes transmit nothing. At the output layer we have a collection of outstar grid neurodes. These are neurodes that have been trained by classical (Pavlovian) conditioning to generate specific output patterns in response to specific stimuli from the middle layer. The neurode from the middle layer that has fired is the hub neurode of the outstar, and it corresponds to some pattern of outputs. Because the outstar-layer neurodes have been trained to do so, they obediently reproduce the appropriate pattern at the output layer of the network. In essence then, the counterpropagation network is exquisitely simple: the Kohonen layer categorizes each input pattern, and the outstar layer reproduces whatever output pattern is appropriate to that category. What do we really have here? The counterpropagation network boils down to a simple lookup table. An input pattern is presented to the net, which causes one particular winning neurode in the middle layer to fire. The output layer has learned to reproduce some specific output pattern when it is stimulated by a signal from this winner. Presenting the input stimulus merely causes the network to determine that this stimulus is closest to stored pattern 17, for example, and the output layer obediently reproduces pattern 17. The counterpropagation network thus performs a direct mapping of the input to the output.

Training Techniques and Problems

...

The Size of the Middle Layer

Now let's consider the size of the middle layer not in the context of training issues but in terms of the accuracy of the networks response. If we are trying to model a mapping with 100 possible patterns, and we set up a counterpropagation network with 10 middle-layer neurodes, then we can expect some inaccuracies in the network's answer. It is not, by the way, so straightforward as saying we will get 10 percent accuracy; we might find a much higher degree of accuracy depending on how densely packed the probability density distribution of the input pattern data is. In the simplest case, the input patterns form a uniform probability density function. If the data patterns are evenly distributed throughout the unit circle, we expect that the weight vectors will also be equally distributed after training. In the two-dimensional case, each weight vector will have to cover about 36 degrees (360 degrees divided by 10 vectors) of the unit circle. In other words, any input vector within this 36 degree arc of the circle will generate precisely the same output. If we use 100 weight vectors to cover this 360 degree span, we would expect that each weight vector will correspond to about 3.6 degrees of arc, so any input vector within 3.6 degrees of a weight vector will count as a hit. The situation becomes more complex if the input patterns are not evenly distributed about the unit circle. In this case the weight vectors are clustered during training about the areas of the unit circle most likely to contain input vectors. Regions outside the most common input areas may end up with very few weight vectors in their region. Thus the occasional input vector that occurs in one of these sparsely populated regions may end up being approximated by a weight vector that is only a gross estimate of the actual input vector. On the other hand, input vectors that do occur in the area densely clustered with weight vectors will be quite closely approximated. In reality, a nonuniform distribution of input patterns is much more likely in a real application than a uniform one. This means that the accuracy of the network's mapping is better in those parts of the unit circle that are more likely to contain input vectors. Rather than having a uniform accuracy, counterpropagation networks have higher accuracy in the more commonly used areas of the unit circle and lower accuracy in the areas less likely to receive input vectors. For many applications, this is quite acceptable and possibly preferable. We still have not answered the question of how many neurodes we need to have in the middle layer. We have only indicated that the answer depends on how accurate the network's output needs to be.

' The more neurodes we have, the more accurate our mapping can be. This is one of the key drawbacks to the counterpropagation network (and the Kohonen network as well), in fact, because real problems may well demand middle-layer sizes too large to build today. If we can afford to have only a limited number of neurodes, the mapping will still work, of course, but it may be less precise than we need. There is no hard and fast rule to apply to this question. As in many other situations with neural networks, the answer is: It depends. There is a way to improve the counterpropagation networks accuracy without requiring an unacceptably large middle layer: we can allow the network to interpolate its output. In other words, if we have trained the network to respond with a 1.0 to a blue input and a 2.0 to a red input, we train the network to output, say, a 1.5 to a magenta input. It is quite simple to implement this kind of interpolation in the counterpropagation network. All we have to do is to change the middle layer to allow more than one winner. For example, we might allow the middle layer to have two winners: the two neurodes with weight vectors closest to the input vector. In this case, the network's output will be a melding of the outputs from the neurode categories in the middle layer. If we want to broaden its response, we might allow three winners. We must be careful not to allow too many winners, or the output pattern will be too ambiguous to be useful. However, permitting multiple winners in the middle layer does give the network the ability to interpolate between known patterns.

pg205...

[The last excerpt from this book is adaptive resonance. Resonance plays such an obvious part of normal human existence that I look to all the different implementations to explain such feelings as love, beauty, respect, awe and religious worship to name a few. The "problems" that are listed with these nets at the end of this chapter are to me only indications that this model is very close to how we have constructed them in our brains. The problem of "noise" in corrupting a nets operation is not present in other nets that can "see thru" noise to recognize degraded information in our consciousness of sight and hearing. But when it comes to conceptualization within the context of cultural shared wisdom, it seems humans play a game of whisper, where a secret is whispered down a line of people and never turns out the same when repeated by the last one to hear it. Also the problem of fineness of adjustment fits my model of levels of resolution and learning which requires finer and finer levels of adjustment before humans acknowledge mastery of human skills. We do not (usually) object if a person is more or less skillful at walking or looking at a landscape, but we expect a high level of skill for a doctor or electrician.]

pg. 207-223

The ideas of adaptive resonance theory can be confusing initially, but the effort expended in understanding them is well spent. ART supplies a foundation upon which we may eventually be able to build genuinely autonomous machines. These networks are as close as anyone has yet come to achieving the goals for the autonomous systems listed in chapter 12.

The Principle of Adaptive Resonance

We can best present the basic idea of adaptive resonance with the two-layer network shown in figure 16.1. The broad arrows in the figure are a shorthand way of indicating that the network layers are fully interconnected with modifiable synaptic weights. Although it will not be important to our immediate discussion, let's assume our net uses the outstar learning model.

Each pattern presented to the network initially stimulates a pattern of activity in the input layer. We call this the "bottom-up" pattern; it is also called the "trial" pattern. Because of the outstar structure, this bottom-up pattern is presented to every neurode of the upper, storage layer. This pattern is modified (in the normal weighted-sum fashion) during its transmission through the synapses to the upper layer and stimulates a response pattern in the storage layer. We call this new activity the "top-down" pattern of activity; it may also be called the "expectation" pattern. It generally is quite different from the bottom-up pattern. Since the two layers are fully interconnected, this top-down pattern is in turn presented (by the synapses on the top-down interconnects) back to the input layer.

We can think of the operation of these two layers in another way. The basic mode of operation is one of hypothesis testing. The input pattern is passed to the upper layer, which attempts to recognize it. The upper layer makes a guess about the category this bottom-up pattern belongs in and sends it, in the guise of the topdown pattern, to the lower layer. The result is then compared to the original pattern; if the guess is correct (or, rather, is close enough as determined by a network parameter), the bottom-up trial pattern and the top-down guess mutually reinforce each other and all is well. If the guess is incorrect (too far away from the correct category), the upper layer will make another guess. Eventually either the pattern will be placed in an existing category or learned as the first example of a new category. Thus, the upper layer forms a "hypothesis" of the correct category for each input pattern; this hypothesis is then tested by sending it back down to the lower layer to see if a correct match has been made. A good match results in a validated hypothesis; a poor match results in a new hypothesis.

If the pattern of activity excited in the input-layer neurodes by the top-down input is a close match to the pattern excited in the input layer by the external input-if the guess is correct, in other words-then the system is said to be in adaptive resonance. The ART systems that we will describe are built on this principle. We will see, however, that we must introduce several complexities into this basic scheme in order to make a working neural network design.

[The statement "the upper layer forms a "hypothesis" of the correct category for each input pattern; this hypothesis is then tested by sending it back down to the lower layer to see if a correct match has been made" and the others in this book that I have previously noted confirms what my "mind of enlightenment" has been pointing to since 1958. Since my social / cultural mind is that of a scientist who became involved in the Arts, I have not expected to get any results or support for publishing this until There is the kind of support provided by research of the last 20 years. I could not have put this new vision into any but mythic language prior to the beautiful work outlined here in this chapter. But beyond this work, I see the connection with the cosmos as a meta model that can form all "hypothesis", but forms specific "hypothesis" for each individual and other hard wired species specific "hypothesis". Other species are restricted in their ability to form "hypothesis", whereas humans can use language to simulate the category of "hypothesis".]

Before we go on, we need to make a note of the internal architecture of these two layers. Recall that the lower layer is devoted to processing the input pattern and achieving adaptive resonance and that the top layer is devoted to pattern storage. In our discussion of the adaptive resonance principle, we concentrated on the interconnections between the input and storage layers. We now need to concentrate on the interconnections within layers, the intralayer connections, of the input and storage layers.

For the minimal ART 1 structure we will consider here, the nodes of the input layer are individual neurodes connected into a competitive internal architecture of the type presented in chapter 7. To simplify the discussion, only one winner will be allowed. In general, multiple winners can be permitted, but nearly all ART systems actually built force a single winner. The rivalry for activity inherent in the competitive structure is essential to the adaptive resonance process. (While this single-winner strategy is not inherent in the design of the network, it is much easier to implement and operate than a multiwinner strategy. ...)

...

Operation of the ART l Network

Let's move on to the nitty-gritty of the network and see how an ART 1 network operates in detail. Such an exploration will tell us much about the delicate balances necessary to build a good hybrid network.

An external pattern presented to the input layer causes some of the nodes in that layer to become active. Because ART 1 allows only binary inputs and the neurodes have binary output functions, this pattern will be identical to the external input pattern. Nevertheless, for consistency of discussion we call the pattern that becomes active in the input layer the "trial pattern."

Each of the neurodes in the input layer is connected to each of the neurodes in the upper, storage layer. By this means, the pattern of activity in the input layer is transmitted to the storage layer. In the process of moving across the synaptic junctions between the two layers, the pattern is modified so that the activity generated in the storage layer differs from the trial pattern.

The pattern arriving at the storage layer signals the beginning of a competition among the nodes in this layer. The winner of this competition generates an output signal and all others are suppressed. The resulting output-layer activity pattern is called the "expectation" pattern. For the moment we will assume that this pattern consists of exactly one active node because we have designed it so that only one node can win the competition. This expectation pattern is, of course, transmitted (via the top-down synaptic junctions) back to the input layer.

[ The "expectation pattern" structure is the start of no-choice Fate and Karma viewpoint of self fulfilling predictions. This is the kind of backpropagation that exists in human "Software" networks constructed in language and culture.]

Everything we have said about the interconnection of the input layer to the storage layer is true of the return path from the storage layer to the input layer. Because the expectation pattern must pass through the synaptic junctions, it will be modified en route; thus, the pattern of activity generated in the input layer will be quite different from the expectation pattern itself. The pattern generated by the expectation pattern in general will involve a number of nodes in the input layer.

The input layer now has two inputs presented to it: the external input, which originally excited the trial pattern, and the top-down expectation pattern. These merge and generate a new pattern of activity, which replaces the old trial pattern. If this new and the original trial pattern are very similar, the network is in adaptive resonance, and the output of the storage layer becomes stable. The corresponding expectation (top-down) pattern is the stored symbol, or icon, for the external input pattern presented. We have not yet considered all possibilities, however. What if the expectation pattern excites a pattern in the input layer entirely different from the original trial pattern? This will happen, for example, when the ART network is presented with a pattern very different from any it has yet seen. Such an input does not match any of the network's "known" patterns and results in a mismatch between the trial and expectation patterns. The ART network must be able to cope with the novelty arising from unusual patterns. We have already seen that its method of coping is hypothesis testing. In order to understand how it actually implements this strategy, we need to add a subsystem to the basic two-layer system of figure16.1.

The Reset Subsystem

Figure 16.2 is nearly identical to figure 16.1, with the exception of a subsystem we call the "reset unit." The reset unit has two sets of inputs: the external input pattern and the pattern from the input layer. The structure of the reset unit depends on the details of the ART 1 system being considered. For our network, we can assume that it is a single neurode with a fixed-weight, inhibitory input from every node in the input layer and a fixed-weight, excitatory input from each external input line. The reset unit's output is simple; it is linked by a fixed-weight connection to every node in the storage layer. By now, you know another way of saying this: the reset unit is the hub of an outstar whose border is all the nodes of the storage layer. There are no reverse connections from the storage layer to the reset unit. Before we go on, we must fully understand the structure of the nodes in this storage layer. We have briefly mentioned that these nodes are little groupings of neurodes called dipoles or toggles. Let's explore what that implies. These toggles have several useful properties. First, they act just like an individual neurode most of the time. Second, and the one of interest to us here, is the property called "reset." Reset is the process of persistently shutting off all currently active nodes in a layer without interfering with the ability of inactive nodes in that layer to become active.

The reset action of a toggle can be summarized in two statements: (1) If the toggle is active and it receives a special signal, called a "global reset," then it will become inactive, and it will be inhibited from reactivating for some period of time. (2) If the toggle is not active when it receives a global reset, then it is not inhibited from becoming active in the immediate future. With these two characteristics, it should be evident that sending a global reset signal to every toggle in the storage layer will shut off only the currently active toggles and furthermore will prevent only those toggles from becoming active in the immediate future.

In general, we can think of the storage layer as simply being made of special nodes that have this reset characteristic. It is not especially important here whether we call them toggles, dipoles, or nodes. It is important to realize, however, that no matter what we call them, they act as outstars and instars, just as an individual neurode does in any network. Now we are ready to see what happens if the network does not reach adaptive resonance that is, if the original bottom-up and the newly generated bottom-up patterns do not match. If they do match, the two inputs to the reset unit (one from the external pattern and one from the input layer) balance, and it produces no output. If the original and the new trial patterns do not match, the activity of the input layer temporarily decreases as its nodes try to reconcile these two patterns. In fact, for reasons we will see, we can be absolutely guaranteed that if the bottom-up and top-down patterns do not match, the net activity (that is, the total number of active nodes) in the input layer will always decrease. As a result, the inhibitory input to the reset unit no longer exactly balances the excitatory input from the external pattern, and thus the reset unit becomes active. The active reset unit now sends its global reset signal to the nodes of the storage layer. Because these nodes are toggles, this reset signal causes any active nodes to turn off and stay off for some period of time. This destroys the pattern then active in that layer and suppresses its immediate reemergence. With the old pattern destroyed, a new pattern of nodes is now free to attempt to reach resonance with the input layer's pattern. In effect, when the old and new trial patterns do not match, the reset subsystem signals the storage layer that that particular guess was wrong. That guess is "turned off," allowing another one to take its place. The cycle repeats as many times as necessary. When resonance is reached and the guess is deemed acceptable, the search automatically terminates. This is not the only way a search may terminate: the system can end its search by learning the unfamiliar pattern being presented. As each trial of the parallel search is carried out, small weight changes occur in the synapses of both the bottom-up and the top-down pathways. These weight changes mean that the next time the trial pattern is passed up to the storage layer, a slightly different activity pattern will be received providing a mechanism for the storage layer to change its guess. If a match is quickly found, the amount of modification of these synapses is insignificant. If the system cannot find a match, however, and if the input pattern persists long enough, the synapse weights eventually will modify enough that an uncommitted node in the storage layer learns to respond to the new pattern. This also explains why the storage layer's second or third guess may prove to be a better choice than the original one. The small weight changes ensure that the activity generated by the bottom-up pattern in the second pass is somewhat different from the activity generated in the first pass. Thus a node that was second-best the first time, may well prove to be the best guess the second time. If the input is a slightly noisy version of a stored pattern, it may require a few synaptic weight changes before the truly best guess can be matched.

[In real social world contexts, my model provides complementary levels of resolution as a decision structure. The patterns are at different scales of a fractal with greater or less "detail" attached or ignored. If the middle layer of internal representation is at "x" level of resolution and the input pattern is at "y" level, and the patterns do not match, then instead of looking for a matching pattern, the middle layer can change the input layers resolution level until a close match is found. It will also move to other vectors of the fractal until the features left after the ignorance function of that vector have a best match. Here I model the star of David with each side generated by a different subset of the whole and each scale having more or less "selected" detail. Thus humans can "force" most any input to match their internal model and become agitated by the remaining mismatch as a threat. Hence the model of sin and Bad people etc. This is how humans can construct a global model which can fit any situation: a general theory or unified model. This is clearly what happens in Astrology and other prescientific "religious" models. Yet the very dysfunction of these models lead to the scientific revolution and was a "first step" that can now be discarded.]

We can also supply ART 1 with the property of vigilance. This means that the accuracy with which the network guesses the correct match can be varied. By setting a new value for the reset threshold, we can control whether the network fusses with trifling details or concerns itself only with global features. Because of the way vigilance is defined, a low reset threshold implies high vigilance and thus close attention to detail, while a high threshold implies low vigilance and a more global view of the pattern. By controlling the threshold of the reset unit, we thus govern what the system calls "insignificant noise" and what it identifies as a "significant new pattern."

We can interpret vigilance in another way. In effect, by setting the threshold of the reset unit, we choose the coarseness of the categories into which the system sorts patterns. A low threshold (high vigilance) forces the system to separate patterns into a large number of fine categories, and a high threshold (low vigilance) causes the same set of patterns to be lumped into a small number of coarse categories.

[Describing my model of levels of resolution with emotional terms like "vigilance"!!]

The Gain Control Subsystem and the 2/3 Rule

We still do not have an operational ART 1 system. We must provide a way for the input layer to tell genuine input signals from spurious top-down signals that might be present even when no real-world input is being presented. Such a situation would exist, for instance, if some random system noise or other extraneous inputs activated the storage layer even when no external input was present. We must also make sure that a genuine external input always creates a pattern in the input layer in order to start the adaptive resonance process. Furthermore, we have not yet justified the assurance we made that the input layer's total activity is guaranteed to decrease in the event of a mismatch. ...

Ideally the external input pattern is presented to the input layer, the gain control system, and the reset system more or less simultaneously. The gain control system turns on, providing the second necessary source of stimulus to the input layer and in turn allowing that layer to become active and generate the original trial pattern. In the meantime, the external input has also turned on the reset system, which shuts off any active pattern in the storage layer. The input layer's activity is translated into a bottom-up pattern and sent to the storage layer. In addition, it goes to the reset system where it matches the external input and shuts off the global reset. This combination of actions allows the storage layer to respond to the bottom up pattern.

The storage layer now generates a top-down expectation pattern, which it sends to the input layer. This same expectation pattern also is sent as an inhibitory signal to the gain control; as a result the gain control system turns off. This removes one of the input layer's two sources of stimulation, but because the input layer now sees the topdown pattern (the new trial pattern) as well as the external pattern, it has sufficient stimulation to stay active.

[More levels of resolution!!]

...

The 2/3 rule also keeps noise damped in the network. Any activity in the storage layer keeps the input gain control from exciting the input layer. If the storage layer is firing, the only other available stimulus for the input layer is the external pattern; if this is present, the input layer's neurodes can activate. If the storage layer fires spontaneously, without an appropriate external pattern being present, the 2/3 rule will not be met and the input layer will not be stimulated into action. Noise from the storage layer thus immediately damps out.

The input layer can also be the source of noise and spontaneous firings. If this happens when there is no external pattern to support it, the noise pattern gets transmitted to the storage layer. But the topdown pattern will shut off the gain control (assuming it was on), so that the input layer will be left with only one source of input (the top-down pattern response) and the noise will be once more damped out. This keeps the storage layer from being bombarded with meaningless bottom-up signals that do not correspond to real inputs. If this were not done, the storage layer would constantly be learning nonsense, and its stored patterns would not be stable. We have so far not addressed the bias gain control shown in figure 16.3. Its function is to allow the presence of an external input to predispose the nodes of the storage layer toward activity even before receiving a trial pattern from the input layer. It does this by applying a small excitatory signal to the nodes of the storage layer when an external input signal is applied to the network. In this way, the activity of the system is correlated with, or paced by, the rate of presentation of external inputs. In some implementations it also helps mediate the competition of the nodes in the storage layer, enhancing the activity of any node that gets an edge on its competitors and suppressing other nodes.

Troubles with ART1

ART 1, even the simple version described, is a pretty good system. It possesses several of the characteristics listed in chapter 11 for a system or machine capable of truly autonomous learning. It learns constantly but learns only significant information and does not have to be told what information is significant; new knowledge does not destroy already learned information; it rapidly recalls an input pattern it has already learned; it functions as an autonomous associative memory; it can (by changing the vigilance parameter) learn more detail if that becomes necessary; and it reorganizes its associative categories as needed. Theoretically it can even be made to have an unrestricted storage capacity by moving away from single-node patterns in the storage layer. This leaves, however, at least one major unsatisfied criterion: a truly autonomous machine must place no restriction on the form of its input signal. Unfortunately, an ART 1 network can handle only binary patterns. This limitation is built into the way the network subsystems interact, implying that it is fundamental to this architecture. This has two ramifications. First, it limits the total number of distinct input patterns that an n-node input layer can allow to 2n. This is an important limitation only for small networks. Ten input nodes, for instance, can handle only 1023 distinct patterns. One hundred nodes, however, allow input of over 1030 separate patterns. The second ramification of binary input nodes is that they place requirements on the type of preprocessing we must give real-world input signals. Under some circumstances, this can be costly in hardware complexity, hardware price, processing time, and total system power. For an application in which the ART 1 system is used with equipment already operating in a digital mode, however, this need not be a serious restriction.

[SO?? This so-called restriction may be the very essence of fractal I Ching: binary structures of many dimensions which seem to abound in the real non-language world!]

ART2

ART 2 removes the binary input limitation of ART 1; it can process gray-scale input signals. The cost of this fix, however, is a considerable increase in complexity. ART 2 systems proposed so far have as many as five input sublayers, each with its own gain control subsystem. Further, each sublayer typically contains two networks.

[Thus later evolution speaks of the 5 elements! With this structure 5 elements allows fuzzy systems!]

There are two major reasons for ART 2's added complexity. The first concerns noise immunity. The noise problem of a network such as ART 1 designed to recognize binary patterns is relatively minor. Individual pattern elements have a sharp "yes-or-no" nature. Thus, to change one element of a binary pattern, a noise signal must be of virtually the same magnitude as the pattern element. The noise problem of a network designed to recognize gray-scale patterns is much more severe, however. Individual elements of such patterns do not have the sharp yes-or-no nature binary patterns possess; instead they can take on a range of values. Two patterns that differ by only one gray-scale value in one element are treated as quantitatively different. As a result, noise that changes only a single pattern element by one gray-scale value may make the input pattern unrecognizable.

The second reason for ART 2's added complexity is that alikeness becomes a much fuzzier concept with analog signals, even with no noise present, because each input element can take on as many values as are in the gray scale being used. Identical, similar, different, and very different are ambiguous terms that must be quantified by the network in some way.

Several ART 2 architectures have been designed by Carpenter and Grossberg and by others. All of these systems accept gray-scale inputs, and they differ mainly in the number and function of the sublayers in the input superlayer.

Grandmother Nodes and ART

As they are usually presented, both ART 1 and ART 2 use "grandmother" nodes in their storage layer. A grandmother node is one that alone represents a particular input pattern. This winner-take-all feature does not appear to be essential to the operation of either type of ART system. We have seen that more than one winner may be allowed in a competitive architecture. However, nearly all actual implementations of ART networks (both ART 1 and ART 2) use this scheme. If the output of a system must operate a relay or be viewed by a human, having a single output node correspond to a particular class of input patterns may make sense. If the output of the system is to be interfaced to other neural networks or to a digital system, however, using grandmother nodes may make no sense at all. Let's see why.

First, using grandmother nodes limits the storage capacity of an ART system to the number of nodes in its storage layer. The severity of this limitation becomes clear when we realize that even with binary storage-layer nodes, the memory capacity we could obtain if each combination of n nodes could be used to store a separate binary pattern would be 2n. ...

A second way in which grandmother nodes may be a problem is that they reduce the reliability of a system. Failure of a single component in one node can cause the pattern coded into that node to be lost. Although there are ways to compensate for this danger, it is one we would rather not have in machines that we expect to use in applications requiring high fault tolerance.

Third, restricting the storage layer to patterns containing only one node limits the ability of the network to represent a hierarchy of concepts or objects. To understand this, we must look at the effect of pattern complexity not on how many input patterns we can store but on the way those stored patterns can be associated.

It may not be intuitively obvious, but it takes fewer storage nodes to encode an input pattern representing a high-level general concept or complex object than it does to encode an input pattern representing a single specific concept or object. The broad concept "tree" needs fewer nodes to encode it than the concept "cherry tree." Treeness must be part of the pattern for cherry tree, as must fruit treeness, hardwoodness, deciduousness, and a host of other concepts needed to distinguish cherry trees from other kinds and classes of trees. "The cherry tree in my front yard," a specific object, requires still more nodes to encode because its coding pattern must carry the extra information required to identify it as a particular tree yet still contain the subpatterns representing "cherry tree" and "tree." Thus, a layer designed to allow storage of hierarchical information must have the ability to form patterns consisting of different numbers of nodes. A storage layer that consists only of grandmother nodes simply will not do. Are there cases in which we would want to use the same complexity in both the input and storage layers of an ART network? Perhaps, but not normally. We usually want to reduce the complexity, or the dimensionality, of the input signal before we store in the storage layer in order to let only the essential features of the input pattern be preserved. That is one reason that we use a competitive structure in the storage layer; allowing competition in that layer guarantees just this outcome.

...

Changing certain parameter values as little as 5 percent can have disastrous consequences for the network's operation. Such fine-tuning requirements make clear that ART 2 has serious problems as a model of our fuzzy, imprecise biological brains.

In fact, both ART 1 and ART 2 have one more serious drawback as a biological model: the requirement that the input and storage layers, as well as the reset and gain subsystems, must be fully interconnected with each other. The connectivity requirements implied by this, and comparisons to the structure of the brain, make it unlikely that the brain widely uses an ART-like architecture. When added to the grandmother node architecture, ART 1 and ART 2 have some serious drawbacks as models of the biological brain.

[Note to personal friends: I may not be able to directly include all these quotes in a published version, but in a CD ROM or network version I will include even more depth. The point is to finish this book to express my ideas about this material. We no longer need to apologize for inclusion when hand held books will soon be a odd custom of a wonderful part of our history. Anyway, I can include what is needed so the so called beginner is not handicapped by not having these resources available. Also, these "sources" from 1990 are already dated and being replaced by new research.]

end excerpt from Naturally Intelligent systems.

Biological Neural nets and their structuring logics are the central organizing principle of this book.

Since biological n.nets are self organizing, and cannot be "influenced" or supervised by outside "forces", any reference to creator gods, or influence of the stars, heredity, survival of the fittest or of the environment is seen as only culturally determined models of redundant cultural systems. This means that structures and elements of the cosmos must be developmentally incorporated into self organized systems to model the environment and new functions and skills. For instance, the development of 3 dimensional binocular vision needed a new dimensional computational apparatus. Since living systems use or alter successful structures to fit new circumstances, it follows that if previous levels of development incorporated the cosmos in predictive computational procedures, new procedures would follow this same line of development. The mathematics of chaos theory show that small permutations over the hundreds of millions of years of evolution of single celled life could be incorporated in awareness structures. This means that single cells may detect and respond to small energy changes undetectable by larger multi cellular organizations.. It is my contention that multi cellular organizations, whether in the size of protozoa or in animal organ systems, or in neural net or social networks can be based on the STRUCTURE of these small energy permutations. These structures themselves are chaotic and cyclical in nature which would tend to produce all forms of life structures, and not be confined to particular "grooves" and determinacy. [Doesn't bode well for the future of monotheism! of determinacy by God or survival of the fittest].

]

10 -1- 94: 8 am

Control as a byproduct of lateral inhibition on layers of competitive n.nets.

Control as self control needs the construction of a self to control, and a self that does the control and something that needs controlling. This control also points to the ability to concentrate using the procedure of ignoring anything and everything outside of some prescribed boundary. From the point of view of n.net surfaces with multiple competing models of behaviors, one model of the "self" is allowed to learn and inhibit or control all competing models. Thus the pattern of having a self that has very well defined boundaries can be described as a together self. All that means is that the other inner models are ignored in favor of the "rules" of a single isolated self. All input is channeled thru this self, and any solutions that self originate in other areas are ignored. These other solutions are perceived as feelings, intuitions, emotions and in general are labeled as disruptive.

[hear notes on tape. include after explanation of fractals

Another property of self organizing neural nets is that given networks in several different individuals with each network having identical inputs and outputs, the different nets organize the solution path differently yet come out with identical solutions: they are chaotically structured, no model of their organization is "correct" or standard even though their behavior is identical. This has been traditionally stated as "there are as many different ways to truth as there are individuals". Since the end behavior of this "black box" is identical across the culture, a single pointer called "name/word" can be assigned to this network. Thus the illusion of equating meaning with structure. Thus the study of knowledge [epistemology] in and by humans that finds the "one right way" for the [global] organization of social, religious especially monotheistic, political, or learning institutions is not based on the biology of the human or any other species. In my model, such efforts are attempts by some individuals to isolate the "One Mind" from its ability to self organize and substitute centralized self control and bivalent "rationality". Since this is impossible all that results is the creation of phase filters that arbitrary select individuals as competitively more suitable and assign greater value to some individuals among identical biological functioning individuals due to there developmental similarity to the "standard model" which may have been developed in religious context or as emperors or other "hero models".

Neural nets like traffic system in closed or limited access network of streets:

if one closes or changes the flow at one single node, it changes the pattern of conductivity thru out the system if the baseline is the pattern of flow and not individual signals!! This models the holographic, distributed, nature of mind [harmony is the message: not single objects] and models brain "activity" as single neurons producing a whole pattern - image because of their placement. Also the whole pattern flow of "Qi" in the "meridians" as network pattern.

ccc13 The "experience" of emotion is pattern! Alien language is harmonic cyclic discrete "words", as is the genetic pattern and proteins as folded patterns. [See what terry recognizes as pattern and information.



Neural Nets - "Naturally Intelligent System

Maureen Cadill - Charles Butler

The training method of competitive systems of culture goes to peaks and valleys and minimum. This applies to the level of incompetence as a local min. that each side as right - left, republican - democrat. Theses two sides tend to converge to a middle of stability , peacefulness. Status Quo - stagnation - stand still - #12. Going to divergence and conceptualizing Making objects and nominalizations.

XXX XXX

XXX US ECONOMY XXX

XXX XXX XX

XXXXXXXX X

XXXXXXXXXXXXXX

JAPAN

This is also a general system theory of emotion - competition brain theory sociological system - political and economic, i.e., cultural - trained supervision. Untrained self organizing are spiritual creative and artistic.

(Establishment).