Introduction
What is the most fundamental particle? If my assumption is correct then it is Presence and Absence or at least, there is a unique isomorphism between any other set of fundamental particles and which is sufficient as far as I concern. I have talked about them in [this] previous blog post of mine. Now, let’s try to find the most fundamental operation, or action, or function that could be applied to the fundamental perceptions. I am going to denote such function as
from now on. I still don’t know much about the nature of this function, however, I can start to define, at least, some of its properties and describe them. Well, first thing first – because everything is made out of the fundamental particles PA, I can also say that the output of
function is no exception. If we embrace the assumption that the inputs and outputs of this function are essentially of the same type then we can start imagining how beautiful recursive behavior(s) could emerge.
If I gave you an incomplete sequence, such as , and then asked you to find the next element in the sequence then one of your most dominant guesses would be
. If I asked you to justify your answer, you could easily say that because each element of the sequence is generated by the successor function applied to the element(number) right before it. Note that this is not the only valid justification that gives the prediction of
and in fact, there are infinite amount of such justifications. You could ask then how do we know which is the one that I had in my mind to generate the “original sequence”? Well, the only way of knowing is repeating this procedure and eliminating the previously valid justifications that do not hold anymore. However, it is important to keep in mind that one could never tell if his/her justification is the “absolutely correct” one and the only thing that can be claimed is that the justification is provably wrong or if it still holds (but without any proof of “absolute correctness”). By absolutely correct, I mean it is the same justification that I had in my mind to generate the sequence.

Let’s consider another example. Try to guess what should the ? marks correspond to in the tables given above. If you had a guess of 0 and -1 respectively at some point, let me ask you for the justification for these answers. Okay, it is not hard to realize we usually justify our answers by using the concepts(i.e., functions in this case) that we have defined. You could ask then “what if we hadn’t had the notions of addition and multiplication?” and you could also ask this question whenever you can hardly guess the next element of a given sequence. The reason why you cannot have a justified guess for the next element or in other words, the sequence being very hard to follow could very well be due to the reason that we have not defined appropriate action/function for it directly. That’s why now you have to combine what you have defined in order to “define the appropriate function” for the sequence at hand. However, the story does not end here and let’s see what would happen if things went in the opposite direction of our expectations…
What if there is no combination of already existing functions to give you the appropriate function generating the sequence mentioned in the previous paragraph? It could be a scary, however, there is still a hope… A hope for discovering yet another “prime number in the set of natural numbers”. This is just an analogy I have made to say that we might discover a new function that is not found in the set of all combinations of the already defined functions.
But wait! How do we even define a function at this point? Because if there is a new function that cannot be represented in terms of the already existing ones then how do we even capture the nature of this function? Assigning a symbol or a name to a function is not a problem at this point, rather our main concern is about “understanding its nature” somehow. Well, since it turns out to be a deeply fundamental function, we do not get to understand it but just memorize it as one of the building blocks. This is also what we do in mathematics with our assumptions. Our mathematical assumptions are not understood within the mathematics itself but instead, outside of it. But who knows how we shall expand our view of mathematics in the future?
Motivation
What is an abstraction/generalization? How are we capable of performing it/them? Let’s try to think about these two questions for a bit. First, we need to ask if abstraction and generalization is the same thing as far as we concern. Abstraction is the process of keeping the parts of something that are relevant to the application at hand and getting rid of all other irrelevant details. This being said, irrelevancy of something is determined by the fact that arbitrary modifications on that something are not going to affect to the final result in any way that we care. Generalization is the process of moving from one level of understanding of something to the higher level of understanding where we have more concepts and/or relations to understand and talk about even more instances than we could before. Note that these are not formal definitions of abstraction and generalization and they are here to give the reader a high level (informal) intuition about these things.
One could see from the informal definitions of abstraction and generalization that they are very closely related if not the same. If there were objects for which I would like to prove some statements, it might be enough to grab some known properties of those objects and proceed with them rather than all known properties. Okay, now you might think that how is this related to the generalization at all since it is all about the same objects? Well, it is about the same objects indeed but the statements about this very same objects could also be viewed as instances of the set of all statements about this objects. So, we understand and talk about even more instances than we could before and therefore, it is a form of generalization. Now imagine that I was able to understand or talk about more instances of, let’s say, cars by using some kind of generalization. Is that the same as abstraction? Well, it might be or might not be the same in this case; if I have used abstracted properties of cars to come to my conclusions about a specific subset of cars then the generalization was obviously possible because of abstraction whereas if I took some of the properties of car instances and turn those properties into a statement(s) about the whole set of cars by “induction”(note that this induction is not mathematical induction since the mathematical induction is more like a deduction in disguise) then I have done more than just abstraction to generalize.
How are we capable of performing abstraction and/or generalization? Although I am not sure at all, I have some guesses. Introspectively speaking, I myself, as someone who is conscious of his struggles with abstract math, come to the realization that gaining a rich understanding of something is related to the ability of creating patterns over patterns. For example, arithmetic is relatively simple to understand but is not enough, by itself, to solve more complex problems. In the other hand, algebra is a bit complicated than arithmetic but has more applications in the real world. If we look carefully then we could see that the patterns of algebra is actually generalizations of the patterns of arithmetic, or in other words, they are patterns on patterns.
Theory
Recursive Perceptual Differentiator or RPD is a function with the following properties:
- Domain-Range equivalence, i.e.,
- Right identity, i.e.,
- Negation, i.e.,
RPDs can be used for learning abstractions by finding invariants in the given perceptions. It is, therefore, capable of generalization although I am going to discuss later to what extent they can possibly generalize. Their recursive nature allows them to operate on different levels of abstraction in order to derive even higher-level abstractions.
DR equivalence
DR(Domain-Range) equivalence states that the input space is cartesian product of the output space of RPDs by itself. This property allows them to be used recursively if needed. We could easily imagine using nested RPDs due to this property. There is actually not much to this property than what has already been mentioned. To illustrate this, we could look at simple examples.
Right identity
The right identity property of RPDs is given as shown below:
Or, as a shorthand, we could write the following:
This property is really important and probably the most interesting one among the others. The intuition behind having this property actually comes from the P.I.P. – my favorite theorem from the high school. In some sense, it is a theorem that uses recursive procedures systematically for function approximation. The procedure that it uses is actually an instance of the set of all RPDs. So, RPD is actually a generalization of PIP in terms of choosing an arbitrary (recursive) procedure that has the already-mentioned properties. For example, the differentiator used in PIP is defined as and moreover,
. I highly recommend the reader to read more about PIP to gain intuitions for the decisions that I have made while defining properties of RPDs.
Negation
The negation property simply states that the order of the parameters matters and in fact, it defines negation of the function behavior with respect to the opposite order.
Let’s see what happens when we play with this property and the right identity:
The statements below also hold:
and
Therefore, we can prove that the following statement is also true:
The Important Question
The important question is, what is the RPD function that we are looking for? To answer this question, first we may need to provide answers to the following questions:
- What are the (potential) implications of defining different RPDs?
- Soundness of RPDs?
- Completeness of RPDs?
- Decidability of RPDs?
- How could one prove the formal language/grammar of an RPD?
- What are possible and/or reasonable definitions for RPDs?
- Which kind of space is “good” for perceptual encodings?
As Godel’s incompleteness theorem states, any contradiction-free or sound system cannot be complete and vice versa. So does this mean that RPDs are doomed as other formal systems? The answer is, it is complicated. Since we are using RPDs to find the invariant(s) by using which we gain predictive power, we could end up having different dynamically changing formal systems that soundly complete each other’s incompleteness.
Some implications of existence of the identity element.
We can even prove extra statements by using the axioms of the right identity and the negation. Let and then follows
Unique right identity element is the one that has the following property:
The identity element need not exist and if it did (e.g., the identity element is 0 in the case of PIP) then there could be extra statements that can be deduced.
Implementation
In this section I am going to talk about just one simple implementation of this theory by using artificial neural networks. I am also going to discuss scalability, flexibility, transferability and explainability of such a network as well as relatively high-level interpretation. Let’s also pick a problem as an example to tackle by using an RPD-NN (RPD Neural Network). It is also worth to mention that RPD-NN will be trained in a self-supervised manner. For example, we could try to have a world model to predict the next frame of the given video input.

The architecture may look like residual neural networks a bit although it is clearly different. Moreover, what makes this architecture different and unique as far as I know is that both ANN modules are actually the same network (i.e., sharing the same architecture and the weights) that approximates function. Positional concatenations are indicated as + in the architecture. That being said, there are basically two constraints on the model behavior:
where and
. The first constraint states that the invariant representation must be the same for all input samples and the second one states that the network must be able to generate the latter input by using the former input and the invariant. The input and output dimensions of the network(i.e.,
and the input dimension for the network is
while the output dimension is just
) is another constraint which is analogous to the DR equivalence. While the latter constraint is to reinforce the right identity axiom, I am not going to add an extra constraints for the axiom of negation for simplicity.
Convergence of RPD-NNs. Convergence criterion for this network is different than that of usual DNNs. It is very common to set the termination criterion for training as the number of epochs or minimum progress threshold(e.g., if loss does not decrease more than some threshold during
successive epochs then terminate) or when the network starts performing poorly in validation set(a.k.a. early stopping) or various combinations and modifications of these. Obviously such kinds of measures could also apply to the RPD-NNs, however, one extra termination condition for this particular network would be obtained by measuring the difference between successive invariant representations(
s) and halting training process when we have got stabilized
over some number of iterations.
Training RPD-NNs. Due to the two constraints mentioned above and the structure of the network, the training process follows different approaches than commonly observed ones. For example, the backpropagation algorithm will have to take the shared network weights in different layers into the consideration.
Let for simplicity and
, then
This was just a simplified version of equations related to the backpropagation performed by the network. Here, the loss function is designed to satisfy the second constraint on the network behavior. For the first constraint, we can take different approaches:
- Randomly sample
and
and expect
- Randomly sample
and
and expect
- Switch to the testing regime and execute backpropagation (maybe near convergence)
Of course, we could use input batching and it would probably be more helpful especially for training this kind of network. We could verify the invariants for all pairs of inputs in the batch with each other although it comes with the cost of increased run-time.
Using trained RPD-NN for prediction. To test the performance of the trained network, feed an input(e.g., frame) to the network as the first argument and the second argument is automatically set to the stabilized invariant representation (of the inference rules). This is to say that during the testing regime, the network takes an input with the dimension of and gives an output with the same dimensionality as the input. This is thanks to the invariant formed by the network during the training process. To visualize this process, you can go ahead and look at the picture given in the Explainability section.
Scalability
Scaling up an RPD-NN could be done in two ways essentially: (1) usual (practical) scaling with the number of network parameters and (2) theoretical scaling with the recursiveness depth of the network. Scaling up DNNs, such as Transformers, has been very intensively done by adding extra parameters to the network and thus, I am not going to talk about this aspect of scaling in this blog post.
Theoretical scaling, roughly speaking, refers to increasing/decreasing the depth of recursiveness in the network. Although we keep the network parameters the same as they were before, the structure of interconnections are modified in a systematic way. By systematic way, it obeys the recursive nature of the PIP theorem. For example, RPD-NN with the depth of 2 has the architecture as shown below:

If you look carefully then you could probably verify that it follows the systematic construction given below:
is a an abbreviation for the Invariant Representation and
is the (maximum) recursive depth. There is one small nuance when you want to compute
when you truly try to predict the next perception/observation. Since you will not be able to use the first rule to compute
(because the second argument of
function would be unknown), you will have to use the second formula by substituting
with the second argument of
function. Then you would get
which is totally computable since the first argument as well as the output of the function would be already known (assuming
then by using the third formula
. To compute further, we could use the right identity property to obtain
. In fact, you could use the following formula for backward computation:
If the second argument of the differentiator function is still unknown, then substitute it again in the left-hand side of the equation given above. Finally, the second argument will be of the form which is equal to
by using the third formula. It is also true that the recursive depth number
is always strictly less than the number of perceptions/observations that have been used in a single training epoch.
Flexibility
Flexibility of such a neural network comes naturally due to the black box nature of the differentiator function. Maybe black box approach could be even more useful than we think after all. For example, we could easily use a CNN auto-encoder and take the latent representations and use them as inputs/outputs of an RPD-NN for the next frame prediction problem. We could even replace the ANN architecture with decision trees or gradient boosted trees to probably increase the interpretability of the network behavior. This is to say that it does not matter what type of algorithms you use as long as the constraints put on the network behavior are satisfiable (usually by iterative training).
Transferability
If you was using this network for the car racing environment and now you want it to test it on another environment such as cart-pole balancing then you could just change the inputs/outputs (except the dimensionality) and keep the invariant representation. Of course, the network may not perform in a new environment as well as it did in the old one but the point is it could be the case that you would need to train it just for a bit to perform equally good rather than spending too much time on training. This really could be the case if the interpretation for RPD-NNs, that I describe later on, is somewhat reasonable.
As far as I know, knowledge transfer between two models (i.e., expert and newbie) happens by taking the first layers of the expert’s network and putting it into the newbie’s network as shallow layers and then train it to adjust for the task at hand. This is also called fine-tuning and with RPD-NNs, it is not the same process. Moreover, you only need to transfer/keep the invariant and hopefully, it will capture even more “pure” knowledge when you train the network on a new environment.
Explainability
To understand how explainable would an RPD-NN be, it is important to first ask ourselves the question what do we mean by explainability? Defining this concept formally hasn’t been easy on humans since the degree of explainability usually contains humans themselves in the loop. For example, one of the current definitions reads that it refers to the mechanism/interface which accurately represents the model behavior and is able to convert it to human understandable representation. Without going into the rabbit hole of word games, I would like to denote that I think not all processes must necessarily be “understandable” by humans. In my opinion, the whole explainability thing stands on top of the concept of human understanding. So, what is the definition of understanding? I don’t know for sure but I think that it is something related to the relational richness among concepts formed by humans. Sure, this is too abstract one might think, however, I will not talk about this stuff in this blog post very thoroughly. Just to clarify for the confused reader, imagine that you think that you already have a good understanding of some concepts such as functions, sets, numbers, etc. and now you are asked to study the group theory. First, it is clear for you that you do not understand what the group theory is and what one could do by using it. Then when after your studies you realize that you have a pretty good understanding of it. How did you get it though? You claim that you have more understanding over the subject because you have successfully related its components and working principles to the things that you already knew such as sets, functions, numbers, and so on. In other words, the more relations you create between a new concept and the old ones, the more understanding you claim you have.
As we go deeper in the recursive architecture of the network, we are going to get more and more abstractions that are harder to grasp. However, this is the same case with arithmetic, algebra, …, (advanced) abstract math. It is important to note that there is at least some systematic way of understanding abstract math from ground up since fundamentally every statement is either a single axiom or derived from a set of axioms and therefore, their compositions used by more advanced math are grounded on the set of axioms as well. At this point it is natural to ask if we have that kind of grounding and compositionality for RPD-NNs.
We should have compositionality because of the structure of the network. If we were to take one sub-invariant and process it with the (universal) invariant then we would potentially get another sub-invariant that is useful to solve tasks belonging to some other specialized domain.
Let’s talk a bit about semantics. If we assume that our RPD-NN model worked reasonably well then it means that it no longer needs the latter input because it can generate it by using the former input
and the
it has formed over time. Now the whole model is cut off by half to throw the invariant forming part and keep the output producing part, and the whole things looks like as shown below:

The question about the picture above is, why do we have a network whose part of input is always the same? Couldn’t we just get rid of the stable part of the input since it probably carries no useful information? We do usually get rid of such stable parts of the input to reduce dimensionality curse and hence the computational cost of convergence. Well, if the network is able to perform well then it could also be okay we pretrained it without using the invariant. However, do we really want to do this? My intuitions say that if the input distribution won’t change much then go for it but if you are not sure then this invariant representation may carry something even much richer than what it has just been trained with. Having the invariants also probably helps to increase the modularity. That being said, note that the invariant is actually no different than a bias to the network and therefore, it can be seen as the “extracted bias”.
Interpretation
How could we interpret the behavior of RPD-NNs in a relatively high-level of abstraction? Let’s sketch out the network architecture one more time for this purpose but this time I will rename the network modules and their outputs. It is as if the RPD-NN is the inference engine and the invariant output is the essential inference rules that are somehow applied to the input (e.g., known facts) in order to produce the output (e.g., new facts or theorems). The hierarchical structure of the network naturally allows different somewhat abstract concepts or categories to compose and decompose depending on the direction you are moving on different depths.
Combinators are essential part of combinatory logic which eliminates the notion of quantified variables in mathematical statements. They are used to construct functions by precisely (symbolic) pattern matching and substitution. One important and old pair of combinators are called S-K combinators and they were invented by Moses Schonfinkel in around 1920’s.
These just two combinators are capable of performing universal computation which means that whatever it is that we can implement in a TM then we could also implement it by using and
combinators. For example, let’s construct the logical AND and NOT function by using combinators.
where which correspond to the boolean values of true and false respectively. If we tried to resolve the AND function with x and y boolean value placeholders then the process would look like as
and the truth table could easily be obtained as shown below:
- If
and
then
- If
and
then
- If
and
then
- If
and
then
You are probably really fascinated by what you have seen if you did not know all this before. I was and I still am. Now the question arises – where the hell are semantics in this computation? What were Moses’s intuitions while thinking about such combinators? I am not going to try to answer these questions in this blog post but it is worth to keep these questions in one’s mind for any universal computational system.
One possible interpretation could possibly emerge by using an analogy between RPD-NNs and S-K combinators. In the case of combinators, there needs to be a, essentially substitution, machine that processes the given input by using the pre-defined (invariant) combinators. RPD-NNs could be that inference/substitution machine and the invariant inference rules/representation could be the output of the network at the predefined maximum recursiveness depth . Remember that the same questions about semantics/meaning of computation arise for RPD-NNs as well.
References
- P.I.P. – my favorite theorem from the high school
- P.I.P. continued…
- Uncertainty, Presence and Absence
External resources
- Wikipedia contributors. (2022, December 25). Recursive neural network. In Wikipedia, The Free Encyclopedia. Retrieved 10:27, January 16, 2023.
- Wikipedia contributors. (2023, January 12). SKI combinator calculus. In Wikipedia, The Free Encyclopedia. Retrieved 13:24, January 16, 2023.