# Are Linear Output Activations Better than Tanh for XOR?

Discussion in 'MATLAB' started by Greg Heath, Nov 13, 2004.

1. ### Greg HeathGuest

Motivated by several recent newbie questions about solving XOR,
I used the 2-2-1 minimal topology (the minimal no. of hidden
nodes is ceil((n+1)/2) for the n-parity net) to compare the
performance of linear and tanh output activation functions.
These are the options 'purelin' and 'tansig' in the MATLAB
routine newff().

Experienced designers typically use the linear output activation.
Unfortunately, MATLAB has implemented the more complicated
tanh as the default in newff. This can be annoying to the
experienced, but more importantly, it can be confusing for
the newbie.

The inputs and outputs were bipolar binary {-1,1} and the hidden
node activations were 'tansig'. All of the default options were
used which led to the Levenberg-Marquardt algorithm with
Nguyen-Widrow weight initialization and stopping criteria

1. MSE = 0
2. Minimum MSE gradient <= 1e-10
3. No. of epochs = 100

Summary statistics were obtained on

1. Number of epochs to convergence or stopping
2. Training time
3. Mean-Square-Error

The results below were obtained from 100 trials for each type
of classifier. As a reference for performance, the naive model
that always outputs the target means {0} has a MSE of 1.0.

In all TANH XOR designs the gradient stopping condition
prevailed. The 56 successful nets had MSE < 1e-11 whereas the
number of 44 unsuccessful nets that had MSEs of 0.5, 0.67 and
1.0 were 39, 1 and 4, respectively. The median MSEs were
3.5e-12 and 0.5 for the successful and unsuccessful nets,
respectively.

I didn't check to see if stopping points for the unsuccessful
nets were local minima or saddlepoints.

In the LINEAR XOR designs the gradient condition prevailed for
the 64 successful nets. These had MSE < 5e-20 with a median of
1.6e-27! All 36 of the unsuccessful nets were automatically
stopped at 100 epochs with a MSE of 0.5.

Results for the successful nets (56 Tanh and 64 Linear)
are summarized below:

----- TANH XOR ----- ----- LINEAR XOR -----
No. of Train MSE No. of Train MSE
Epochs Time (TANH) Epochs Time (LINEAR)

min 11 0.3 2.7e-13 4 0.3 2.8e-32
med 15 0.4 3.5e-12 6 0.3 1.6e-27
mean 18 0.4 3.9e-12 13 0.4 7.4e-22
std 6 0.1 2.0e-12 16 0.1 5.7e-21
max 37 0.6 9.4e-12 66 0.8 4.6e-20

Next, the results for the unsuccessful nets (44 Tanh
and 36 Linear):

----- TANH XOR ----- ----- LINEAR XOR -----
No. of Train MSE No. of Train MSE
Epochs Time (TANH) Epochs Time (LINEAR)

min 2 0.3 0.5 100 0.9 0.5
med 18 0.4 0.5 100 1.0 0.5
mean 17 0.4 0.6 100 1.0 0.5
std 5 0.1 0.2 0 0.04 0.01
max 32 0.6 1.00 100 1.1 0.5

Finally, the results for the complete sample are
summarized:

----- TANH XOR ----- ----- LINEAR XOR -----
No. of Train MSE No. of Train MSE
Epochs Time (TANH) Epochs Time (LINEAR)

min 2 0.3 2.7e-13 4 0.3 2.8e-32
med 17 0.4 6.4e-12 12 0.4 5.9e-24
mean 17 0.4 0.24 44 0.6 0.2
std 5 0.1 0.29 44 0.3 0.2
max 37 0.6 1.00 100 1.1 0.5

Although this is a limited example and proves nothing,
I recommend using the linear output activation unless
there is an overwhelming mathematical or physical
constraint that must be obeyed.

Only then would I consider tanh or softmax.

Hope this helps.

Greg

Greg Heath, Nov 13, 2004

2. ### A. BulsariGuest

Respected Dr. Heath,

1, The "XOR problem" is an ill-defined problem until the actual
function or the logical expression being approximated is made clear
(which is almost never the case, and allows kids of all ages to play
with it forever).

2, You seem to be catching the infection of experimental mathematics,
even though you seem to have a fairly deep understanding of
approximation theory.

3, My point which you ignored in our previous communication (about
classification as approximating discontinuous functions) applies
here also, and you do not seem to see the two as different issues
with a little overlap. Of course, my point (1) above would resolve
this issue.

With kind regards,

A. Bulsari

A. Bulsari, Nov 17, 2004

3. ### Greg HeathGuest

-abo.fi (A. Bulsari) wrote in message
Excellent way to start a reply! (Hear that newbies?)
Sorry. I thought there was only one way to state the "XOR problem"
using bipolar binary I/O.
True. Some of us never grow up.

My post was in response to the numerous questions from newbies
via e-mail and newsgroup postings. It's probably that time of
the school year.

Since retirement, I've become very interested in learning
to use MATLAB for neural networks and other applications (my
previous experience was 40 years of FORTRAN, the last 25 years
of which was coded by software engineer support staff).

My original target was newbies in c.s-s.m. However, I thought
cross-posting to c.a.n-n might be useful.

Sorry if you think otherwise.
Seem? You should see some of the stuff I haven't posted!

Also, keep in mind that I am (can I say that when I've retired?)
an engineer whose forte was in solving physical problems using
mathematics. Although my interest in semi-formal mathematics
is probably much more than the typical research engineer,
computer experimentation has been, and still is, one of my most
useful tools.

Now, I'm just having fun playing with MATLAB solutions to
interesting newsgroup questions in applied math.
I don't have a clue as to what this means. Please email me
with some details and we can hash this out between us. We
can post our conclusions in c.a.n-n whether or not we agree.
Ditto,

Greg
----------SNIP

Greg Heath, Nov 18, 2004
4. ### Greg HeathGuest

-abo.fi (A. Bulsari) wrote in message
I didn't ignore it. For some reason, I never saw it.

I'll read it and respond via c.a.n-n.

Sorry, but why didn't you email me?

Greg

Greg Heath, Nov 18, 2004
5. ### A. BulsariGuest

As I see it, it is not a _problem_ at all, or is ill-defined at
best. From a logician's point of view, there are only four members
(four points) in the domain of the function (for Boolean logic), or
the closed interval square [0,1]^2 for fuzzy logicians (who would also
need to specify the T-norm and S-norm for the fuzzy operations), for
which we would want a feed-forward neural network (preferably an MLP)
to predict the XOR of. From a function approximation people's point of
view (and you may say that they have no business being involved in
this discussion), the domain of the function would have to be defined,
(perhaps extending even to negative values,) the XOR of non-crisp
inputs would have to be defined, and then can there be a question of a
neural network imitating or approximating that.

I haven't written this very clearly anywhere nor I do know of papers in
literature which do that, but this result should be borne in mind by anyone
doing classification or logical operations with neural networks:

* Any Boolean logical expression can be implemented (approximated with
zero error) with one hidden layer and one output layer with Heaviside
activation function (step function, or infinitely steep sigmoids). I
am not a mathematician and do not usually prove these things
rigourously, but can write an outline of the proof, if someone wants
claim credit for it.

* All classification problems are "category B" (Boolean outputs)
problems, and are based on the ability of feed-forward neural networks
to implement any Boolean logical expression. There is no better way to
solve classification problems with real numbers as inputs than by
using feed-forward neural networks with one or two hidden layers and
with sigmoids also on the output layer. (Politically correct people
may do some other things in some countries. There are several amusing

The XOR Boolean expression can be implemented with a (2,2,1) MLP. That
would have 9 weights. Such a network cannot be trained sensibly from 4
observations. Sorry, I am not used to talking much with the

The XOR fuzzy expression with product as the T-norm can be
approximated with the "productive" networks I wrote about in the early
1990s. I do not encourage anyone to waste much time on them, but they
are a result of an interesting observation about being able to detect
fuzzy relations between truth value inputs and truth value or crisp
outputs (in other words, one can determine the logical expression by
learning, somewhat like in inductive learning methods.)

As a function approximation problem, XOR expression and function
domain need to be specified, and this is never done. Still, most kids
of all ages play with neural networks as if the "XOR problem" were a
function approximation problem. This, in my opinion is childish.

My stand on the matter: There is no such thing as an XOR problem, or
an ill-defined one, at best. Nobody has ever defined it, and if it is
supposed to mean the Boolean expression then neural network training
is not the way to go about it. One can write the weights on the
network in ten seconds, like many of you can do.
You are very welcome to c.a.n-n. also. You are doing a great service
to people new to neural networks.

I too am an engineer, used to writing in Fortran and Basic, and my
interest in deeper understanding of function approximation theory is
exclusively to be able to solve real world nonlinear modelling
problems in process engineering and materials science as best as
possible. Feed-forward neural networks turn out to be valuable for
approximating the unknown nonlinearities. For known nonlinearities, or
qualitatively known nonlinearities, one can do better. (Which reminds
me that nobody ever responded to my question if someone had bothered
to formally classify qualitative nonlinearities).
I think we can discuss that by e-mail. This c.a.n-n. is probably more suited for
people of all ages having trouble with back-propagation.

With kind regards,

A. Bulsari

A. Bulsari, Nov 20, 2004