As of this morning, I have sunk 156 hours and 30 minutes in my term project for the compulsory choice module on adaptive systems, the aforementioned maze. That’s nearly 30 per cent of the entire time I have invested so far, outside lectures, this term, and — with four weeks to go! — already way beyond the notoriously impossible official workload for a single course. In fact, when I talked with the professor last week, he said for better or worse I had done enough. Obviously I had learned a lot about reinforcement learning and neural networks, and if I simply summarized my experiences well in the presentation and paper required en lieu of an examination I could be certain of a good grade.
That’s probably true. Yet I feel defeated nevertheless. You see, so far in this study program I have always found that, even when I felt desperately unable to make headway with a problem, it was only about being stubborn enough. Where many of my peers gave up, I simply continued banging my head against the wall, and finally, every time, the wall would yield. But the walls of this maze are solid as rock, and in spite of reading tons of stuff on reinforcement learning with neural networks, trying a dozen of different techniques, playing with hyperparameters for several weeks, and debugging my code word for word, again and again, I’m simply not getting anywhere. My bloody networks obstinately refuse to learn.
Yes, this was a lot of trial and error. Yes, I have committed countless small mistakes, as in any programming project, and some big ones, particularly since I often changed the setup entirely to try a new idea. For instance, I had originally encoded the game board with a single network input value for each of the four edges of the board tiles, so to indicate where a connection existed and where not. I later changed that to a single binary encoding for each tile, converted to a decimal number (e.g. a tile might have connections up, right, and down, but not left, encoded clockwise as 1110 binary, and then converted to 14 decimal), to simplify the input. Unfortunately in so doing I forgot to encode the extra tile that is available for being pushed into the maze by the next player. So for a long time my network simply had no knowledge about the tile it had to push. Hard to learn anything that way. Even worse, a stupid comparison error in a helper method used to find the position of the next treasure made that method always return true for any tile checked. So the treasure was always encoded as being on the first tile encountered, the one in the upper-left corner of the board. Again, hard to find something when you have no idea where it is.
But even after I had ironed out these mistakes, the networks still refused to learn. I mean, learn anything at all. I had expected this to be hard, but hard in a way recognizable as hard but not impossible. As in, learning is painfully slow, but at least it’s evident something is being learned. But that was simply never the case. I could run my networks for days on end and the results never changed. I went through all possible ways of improving the (non-existent) learning performance. The classic Q learning algorithm learns from the results of every single turn, which is generally acknowledged as being less than optimal for a sequential game, for the network might easily be caught in so-called local minima, i.e. learn only those things resulting from an earlier, in itself suboptimal, decision. The solution is called experience replay, meaning the results are stored in memory, and then the network learns from random (non-sequential) samples drawn at intervals from that memory. Experience replay itself can again be improved by different methods of selecting primarily, or only, the more relevant experiences, i.e. those that resulted in particularly valuable rewards or updates, either by replaying the last few results whenever a move resulted in a reward (enhanced experience replay), or by ranking the results according to the amount learned from each of them (prioritized experience replay). I tried all that, but vexingly, with experience replay my networks did even worse than with what I called “live” updates (learn once per turn from the results of the previous turn).
What really drove me nuts (or in fact is still doing that) is that the networks are far from totally inert. In fact, to the contrary: They are very sensitive to changes in the myriads of hyperparameters with which you can configure either them or the game. I am using the Keras library for building my networks, and Keras makes it very simple to finetune the network architecture. With a few keywords you can change the type and number of layers, the number of nodes in each layer, their initialization, activation function, the loss function and optimizer for the output, apply regularizers, adjust parameters for all those functions, and so on. Apparently even for experts this is all a lot of trial and error with very few generally quite vague guidelines.
Then of course you can play with your input, for instanze normalize or standardize it, so that all values are within a certain range. You can adjust the calculation of the error you provide as feedback for your network, i.e. how you come up with the numbers that tell it how to adjust its predictions. Change the learning rate, and the network will either overwrite what it has memorized when being given new information, or conversely hardly change it at all. Finally, the structure of the game rewards that impact on that calculation has a huge influence. In the purist’s version I am giving a large positive reward for finding a treasure and a small negative reward for each turn used, so to make the networks aware of time pressure. But it also works with only the one or only the other, and in any case there is no knowing the appropriate relation between the two: I’ve been everywhere between 1 to -1 and 20,000 to -1. And it’s possible to provide hints to the network, like give it a bonus for having a long contiguous path in the maze, or a path to the treasure, or for proximity of the player’s piece to the treasure, and so on.
In short, there are several dozen parameters in the model and in the networks that you can configure independently and whose interaction has a huge impact on the results. But the vexing thing is that in my case none of them seem to have any obvious effect on learning. To the contrary, depending on the configuration–and I have tried and tested this literally for weeks–the networks take either more or less time for finding a treasure right from the start. In the very first tries, the networks would take about 30,000 to 50,000 moves to find all 24 treasures in a game. With a bad configuration, this could get as bad as half a million moves. Over many weeks and with constant experimentation I have found several configurations that get this number down to 1,000 to 2,000 moves. But again, this happens from game one. I can run the networks for 100,000 games, and the number of moves per game will stubbornly stay in the same range, oscillating but never going anywhere. I have records, logs, and graphs to prove it.
Of course this makes no sense at all. It’s simply impossible to square this finding with what we know about the way neural networks, or reinforcement learning for that matter, works. I talked with the professor, in fact at length, and he was equally baffled. We could both accept the fact that the networks would refuse to learn at all, but then how would the hyperparameters of the networks (the settings outlined above) affect their performance? The professor suggested I should simplify the game, i.e. go from the original 7×7 tiles to a smaller number. This I did, and tried with 5×5 and finally a rather trivial 3×3 tiles. Now the networks solved the game a lot faster, of course–the 3×3 can be done with 10 to 20 moves–, but still there was no evidence of learning. And besides, there remains the fact that my 8-year-old daughter can find all 24 treasures in the original 7×7 board game within less than a hundred moves. The 1,000 moves my networks need at the best of times are simply ridiculous in comparison.
There is no denying it–since Q learning and neural networks, and even the combination of both, evidently work (they play chess at grand master level, you know!), there simply has to be a fundamental error in my model. Somewhere. Still, I have already analysed and debugged this nearly to death, and I am actually quite confident that my implementation is correct. The game works. The update rule for the learning mechanism works. The networks work. Yet still, nothing is ever learned.
In a desperate attempt to find out what was going on, I turned on the crude visualization I had created weeks ago when programming the game itself and watched my network play on the ridiculously small 3×3 board. Sure enough, this setup is trivial–there is usually an open path from the player piece to the treasure at least 80% of the time, and all the network needs to do is move the piece there and collect. Yet about with one treasure every game–there are six treasures in the small version–it obstinately avoided doing just that. Instead it ran past the treasure again and again, until the maze moved and the path was no longer open, and needed several hundred moves where it should have needed but one or two. After watching this for a couple of hours, it occurred to me that the moves repeated–both those of the network for sliding the tiles and of those for moving the player piece. The networks went into an infinite loop and didn’t manage to break out of it for hundreds or thousands of moves.
And that, finally, gave me the hint I needed to find one last, but crucial error in my code. You see, in reinforcement learning there is a thing called the exploitation/exploration dilemma. With each move, the agent (the network) has to decide whether it wants to cash in on the experiences it has already made, i.e. choose the move it knows to be best, so far–exploitation–or instead randomly select a different move so to possibly learn something new–exploration–, with the chance to improve its strategy and thus get better rewards later. That choice, in any implementation, is governed by a parameter usually called epsilon that should start rather high, so that the agent initially explores a lot, and then be constantly reduced, so that the agent usually cashes in on its experiences, but never quite stops exploring. Now due to a stupid coding error I had made, my epsilon was almost instantly reduced to next to nothing within the first 4 or 5 games. So for the rest of the 100, or 1,000, or 10,000 games I ran the networks while trying to train them, they were virtually unable to learn anything new. They no longer explored! So that’s why they were never able to break out of their infinite loops–they simply were not given the chance to do anything but do what they had learned in the first handful of games.
That was quite a find, because it explained why the networks were unable to improve, and even why, quite vexingly, sometimes the best games had been the first few games, after which the networks settled on a more average performance depending on the hyperparameter settings. After that, I was quite enthusiastic for a moment. Would my networks finally learn?
Unfortunately, it doesn’t really look that way. As before, they do fine while the epsilon (exploration rate) is high, and now they do fine longer because the epsilon is high longer. But once the exploration rate goes down, the networks start doing worse again. So I am left with the annoying conclusion that random exploration gives better performance than anything my networks were, or are, able to “learn.”
And now, unfortunately, stark reality interferes. Because today is the 14th of June, and on the 14th of July is the presentation. And I need to hand in the paper three days before that date, just four weeks hence. And besides, I now have to remind myself that this is just one course in four, albeit by far the most interesting one, and there will be three other exams before that one, and I very urgently have to start preparing for them. So basically I am now stuck with what I have because I am already out of time with this. As the professor said, I had now better start cut my losses and write down what I have found so far. Even though it means I have to swallow the bitter disappointment that, in spite of all the time and effort I have invested, this is the first wall in this course of studies that refused to yield. Regardless of how long and how hard I banged my head against it.