Are Linear Output Activations Better than Tanh for XOR?

Discussion in 'MATLAB' started by Greg Heath, Nov 13, 2004.

  1. Greg Heath

    Greg Heath Guest

    Motivated by several recent newbie questions about solving XOR,
    I used the 2-2-1 minimal topology (the minimal no. of hidden
    nodes is ceil((n+1)/2) for the n-parity net) to compare the
    performance of linear and tanh output activation functions.
    These are the options 'purelin' and 'tansig' in the MATLAB
    routine newff().

    Experienced designers typically use the linear output activation.
    Unfortunately, MATLAB has implemented the more complicated
    tanh as the default in newff. This can be annoying to the
    experienced, but more importantly, it can be confusing for
    the newbie.

    The inputs and outputs were bipolar binary {-1,1} and the hidden
    node activations were 'tansig'. All of the default options were
    used which led to the Levenberg-Marquardt algorithm with
    Nguyen-Widrow weight initialization and stopping criteria

    1. MSE = 0
    2. Minimum MSE gradient <= 1e-10
    3. No. of epochs = 100

    Summary statistics were obtained on

    1. Number of epochs to convergence or stopping
    2. Training time
    3. Mean-Square-Error

    The results below were obtained from 100 trials for each type
    of classifier. As a reference for performance, the naive model
    that always outputs the target means {0} has a MSE of 1.0.

    In all TANH XOR designs the gradient stopping condition
    prevailed. The 56 successful nets had MSE < 1e-11 whereas the
    number of 44 unsuccessful nets that had MSEs of 0.5, 0.67 and
    1.0 were 39, 1 and 4, respectively. The median MSEs were
    3.5e-12 and 0.5 for the successful and unsuccessful nets,
    respectively.

    I didn't check to see if stopping points for the unsuccessful
    nets were local minima or saddlepoints.

    In the LINEAR XOR designs the gradient condition prevailed for
    the 64 successful nets. These had MSE < 5e-20 with a median of
    1.6e-27! All 36 of the unsuccessful nets were automatically
    stopped at 100 epochs with a MSE of 0.5.

    Results for the successful nets (56 Tanh and 64 Linear)
    are summarized below:

    ----- TANH XOR ----- ----- LINEAR XOR -----
    No. of Train MSE No. of Train MSE
    Epochs Time (TANH) Epochs Time (LINEAR)

    min 11 0.3 2.7e-13 4 0.3 2.8e-32
    med 15 0.4 3.5e-12 6 0.3 1.6e-27
    mean 18 0.4 3.9e-12 13 0.4 7.4e-22
    std 6 0.1 2.0e-12 16 0.1 5.7e-21
    max 37 0.6 9.4e-12 66 0.8 4.6e-20

    Next, the results for the unsuccessful nets (44 Tanh
    and 36 Linear):

    ----- TANH XOR ----- ----- LINEAR XOR -----
    No. of Train MSE No. of Train MSE
    Epochs Time (TANH) Epochs Time (LINEAR)

    min 2 0.3 0.5 100 0.9 0.5
    med 18 0.4 0.5 100 1.0 0.5
    mean 17 0.4 0.6 100 1.0 0.5
    std 5 0.1 0.2 0 0.04 0.01
    max 32 0.6 1.00 100 1.1 0.5

    Finally, the results for the complete sample are
    summarized:

    ----- TANH XOR ----- ----- LINEAR XOR -----
    No. of Train MSE No. of Train MSE
    Epochs Time (TANH) Epochs Time (LINEAR)

    min 2 0.3 2.7e-13 4 0.3 2.8e-32
    med 17 0.4 6.4e-12 12 0.4 5.9e-24
    mean 17 0.4 0.24 44 0.6 0.2
    std 5 0.1 0.29 44 0.3 0.2
    max 37 0.6 1.00 100 1.1 0.5

    Although this is a limited example and proves nothing,
    I recommend using the linear output activation unless
    there is an overwhelming mathematical or physical
    constraint that must be obeyed.

    Only then would I consider tanh or softmax.

    Hope this helps.

    Greg
     
    Greg Heath, Nov 13, 2004
    #1
    1. Advertisements

  2. Greg Heath

    A. Bulsari Guest

    Respected Dr. Heath,

    1, The "XOR problem" is an ill-defined problem until the actual
    function or the logical expression being approximated is made clear
    (which is almost never the case, and allows kids of all ages to play
    with it forever).

    2, You seem to be catching the infection of experimental mathematics,
    even though you seem to have a fairly deep understanding of
    approximation theory.

    3, My point which you ignored in our previous communication (about
    classification as approximating discontinuous functions) applies
    here also, and you do not seem to see the two as different issues
    with a little overlap. Of course, my point (1) above would resolve
    this issue.

    With kind regards,

    A. Bulsari
     
    A. Bulsari, Nov 17, 2004
    #2
    1. Advertisements

  3. Greg Heath

    Greg Heath Guest

    -abo.fi (A. Bulsari) wrote in message
    Excellent way to start a reply! (Hear that newbies?)
    Sorry. I thought there was only one way to state the "XOR problem"
    using bipolar binary I/O.
    True. Some of us never grow up.

    My post was in response to the numerous questions from newbies
    via e-mail and newsgroup postings. It's probably that time of
    the school year.

    Since retirement, I've become very interested in learning
    to use MATLAB for neural networks and other applications (my
    previous experience was 40 years of FORTRAN, the last 25 years
    of which was coded by software engineer support staff).

    My original target was newbies in c.s-s.m. However, I thought
    cross-posting to c.a.n-n might be useful.

    Sorry if you think otherwise.
    Seem? You should see some of the stuff I haven't posted!

    Also, keep in mind that I am (can I say that when I've retired?)
    an engineer whose forte was in solving physical problems using
    mathematics. Although my interest in semi-formal mathematics
    is probably much more than the typical research engineer,
    computer experimentation has been, and still is, one of my most
    useful tools.

    Now, I'm just having fun playing with MATLAB solutions to
    interesting newsgroup questions in applied math.
    I don't have a clue as to what this means. Please email me
    with some details and we can hash this out between us. We
    can post our conclusions in c.a.n-n whether or not we agree.
    Ditto,

    Greg
    ----------SNIP
     
    Greg Heath, Nov 18, 2004
    #3
  4. Greg Heath

    Greg Heath Guest

    -abo.fi (A. Bulsari) wrote in message
    I didn't ignore it. For some reason, I never saw it.

    I'll read it and respond via c.a.n-n.

    Sorry, but why didn't you email me?

    Greg
     
    Greg Heath, Nov 18, 2004
    #4
  5. Greg Heath

    A. Bulsari Guest

    As I see it, it is not a _problem_ at all, or is ill-defined at
    best. From a logician's point of view, there are only four members
    (four points) in the domain of the function (for Boolean logic), or
    the closed interval square [0,1]^2 for fuzzy logicians (who would also
    need to specify the T-norm and S-norm for the fuzzy operations), for
    which we would want a feed-forward neural network (preferably an MLP)
    to predict the XOR of. From a function approximation people's point of
    view (and you may say that they have no business being involved in
    this discussion), the domain of the function would have to be defined,
    (perhaps extending even to negative values,) the XOR of non-crisp
    inputs would have to be defined, and then can there be a question of a
    neural network imitating or approximating that.

    I haven't written this very clearly anywhere nor I do know of papers in
    literature which do that, but this result should be borne in mind by anyone
    doing classification or logical operations with neural networks:

    * Any Boolean logical expression can be implemented (approximated with
    zero error) with one hidden layer and one output layer with Heaviside
    activation function (step function, or infinitely steep sigmoids). I
    am not a mathematician and do not usually prove these things
    rigourously, but can write an outline of the proof, if someone wants
    claim credit for it.

    * All classification problems are "category B" (Boolean outputs)
    problems, and are based on the ability of feed-forward neural networks
    to implement any Boolean logical expression. There is no better way to
    solve classification problems with real numbers as inputs than by
    using feed-forward neural networks with one or two hidden layers and
    with sigmoids also on the output layer. (Politically correct people
    may do some other things in some countries. There are several amusing
    stories about this.)

    The XOR Boolean expression can be implemented with a (2,2,1) MLP. That
    would have 9 weights. Such a network cannot be trained sensibly from 4
    observations. Sorry, I am not used to talking much with the
    back-propagation and cascade correlation people.

    The XOR fuzzy expression with product as the T-norm can be
    approximated with the "productive" networks I wrote about in the early
    1990s. I do not encourage anyone to waste much time on them, but they
    are a result of an interesting observation about being able to detect
    fuzzy relations between truth value inputs and truth value or crisp
    outputs (in other words, one can determine the logical expression by
    learning, somewhat like in inductive learning methods.)

    As a function approximation problem, XOR expression and function
    domain need to be specified, and this is never done. Still, most kids
    of all ages play with neural networks as if the "XOR problem" were a
    function approximation problem. This, in my opinion is childish.

    My stand on the matter: There is no such thing as an XOR problem, or
    an ill-defined one, at best. Nobody has ever defined it, and if it is
    supposed to mean the Boolean expression then neural network training
    is not the way to go about it. One can write the weights on the
    network in ten seconds, like many of you can do.
    You are very welcome to c.a.n-n. also. You are doing a great service
    to people new to neural networks.

    I too am an engineer, used to writing in Fortran and Basic, and my
    interest in deeper understanding of function approximation theory is
    exclusively to be able to solve real world nonlinear modelling
    problems in process engineering and materials science as best as
    possible. Feed-forward neural networks turn out to be valuable for
    approximating the unknown nonlinearities. For known nonlinearities, or
    qualitatively known nonlinearities, one can do better. (Which reminds
    me that nobody ever responded to my question if someone had bothered
    to formally classify qualitative nonlinearities).
    I think we can discuss that by e-mail. This c.a.n-n. is probably more suited for
    people of all ages having trouble with back-propagation.

    With kind regards,

    A. Bulsari
     
    A. Bulsari, Nov 20, 2004
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.