From the Trenches of a Bachelor Thesis Project

Three weeks ago I said I finally had all my simulated (fake) data in the database. So data generation was done. The next step was (and is) getting to grips with all the technologies and libraries involved in the actual machine learning process, particularly since this is supposed to eventually run in the cloud.

A lot of moving parts. Yes, I had all the data, but it was spread over a PostGreSQL database, numerous CSV files, the internet. So I wrote a Scala application to collect all that stuff and come up with vectors of doubles (floating point numbers–incidentally, 42 of them!) as input for the neural network. Even though database access, Scala futures, and Scala collections were still somewhat unfamiliar, this took just a couple of days to get right. Next I did the basic setup for creating and training a neural network, using the Java-based machine learning network DL4J mentioned earlier. Some trial and error, but again nothing fancy. After a few hours this tiny setup actually did something–it collected the data, produced vectors of doubles, created a neural network, and trained it. Without success to be sure, but it was a first step.

Actually getting it to work was harder. For one thing, as with data generation, the data size far exceeded the memory limits of the Linux virtual machine and the JVM. With over 9 million prices/sales entries I could produce more than 27 million vectors, since each entry covers data for three different fuel categories. Streaming the data input was the obvious solution. Next it turned out the data needed to be shuffled. As it was, I was getting all location features and all prices for one gas station at a time from the database and then feeding them to the network in this same order.

Randomizing the order of the just over 1,000 gas stations was simple, but didn’t help much, because each gas station provided over 26,000 data vectors (365 days times 24 hours times 3 fuel categories). So even if I used a much larger batch size when streaming the data, by the time the network ever saw a single data vector from a second station it was seriously overfitted on the first and started throwing NaN (not a number) values because the new numbers were quite outside its range. No go.

The obvious solution was not going station by station, but taking the actual price/sales entries in random order. Not so easily done. Yes, in theory you can shuffle the result of a database query. With 9 million rows, however, this takes more system resources than most personal computers have. Shuffling the query result instead was out for the same reason. My practicum partner had the great idea to instead shuffle an array of integer values the size of the data set, then call the data with their numeric ID. No go either. Scala does not shuffle a collection with 9 million values. It does shuffle 4,5 million values. I ended up just drawing random integer IDs from a range the size of the database table, at the slight risk of duplication. This worked. And the networks, with randomized data, now actually produced reasonable values instead of NaN.

However, now data generation was taking forever. Small wonder, since before I had calculated more than 26,000 vectors per station as a result of just 11 database queries (1 for the sales data of the station itself and 10 for its 10 nearest competitors) whereas now I was doing the same number of queries for every single vector! Clearly this was not the solution either. I struggled with this for a while, trying to find a compromise. I finally settled on again going gas station by gas station, but randomly sampling only a limited number of the 8760 price/sales entries for each station. This rationalized data generation but reduced the sample size.

I trained a few networks for a day or so, visualizing and debugging the training process (DL4J, while at times quite buggy, comes with a very nice GUI). I played with layout, parameters (plenty of variables go into designing a neural network), data input normalization and so on, and tried to understand the “score” output by the evaluation function. You see, my previous networks did classification. This here is regression–predicting a continuous value, the sales volume for a given fuel price under given environmental parameters–and I am still not sure what these figures mean, except apparently lower is better.

It was fascinating and easy to get lost in. But after a day and a half I called myself to order. Afterall, actually training and optimizing networks is not the aim of the project. It’s proving a concept. Efficient sales prediction by a neural network trained in the cloud.

So I returned to once more banging my head against the particular wall of this project, the one I had already once turned away from in despair a couple of months ago: How to get my data into the cluster so I can train my neural networks there. Evidently since there is no way of manually putting a huge database in the cluster, I had to recreate it there: Install the database, load the fuel prices dump, and run the data generator application once more. In the cluster. I thought.

What followed in the next 10 days or so was a veritable (and lonely) trial-and-error crash course in cloud-deploying a database and a Scala application and connecting the two. For a couple of days there I was ready to jump out of a window because it made all so little sense. Even though I had pushed aside the questions of parallelizing neural network training, using a Spark cluster, and all that, I still had to contend with Helm (a Kubernetes package manager that uses an incomprehensible multitude of .yaml files to configure deployments), Docker (containers), SBT assembly (building a Scala JAR–a runnable program) and of course Kubernetes itself.

All of this involved many hours of debugging problems I had never given any thought to, like how my Scala program would find the data files at runtime when the JAR has a completely different folder structure than the original Scala project. Not to mention how to access the PostGreSQL database that Helm thankfully created in my cluster without much fuss and even cared for its persistence (which is not at all automatic in a stateless container framework), and to actually load the fuel price data dump into it.

As an illustration, this last step alone involved logging in to the actual database pod with Kubernetes’ command line control interface, finding out which Linux distribution ran on that pod, which package manager it used, updating that package manager, using it to install the wget tool, using wget to download the data dump, loading it into the database, creating a new user in the database, and giving it the necessary privileges.

And then I still didn’t know how to actually connect to that database from another pod in the cluster. Or from outside the cluster. Yes, eventually I figured this all out. But it took a lot of trial and error. And desperation.

At one point I wondered if I wasn’t overdoing it. Yes, the data needed to be in the database in the cluster. But did that mean it had to be generated in the cluster? Or could I generate it locally and write the data to the cluster database? Afterall, this whole data generation business was not central to the project, so any makeshift solution was good enough.

In theory. In practice at this point, when I had all the setup complete and just needed to generate the data once more and write it to the cluster database, I found I was basically nowhere. Data generation, which had already been a stretch when doing it locally, i.e. when writing to a local database, totally crashed when writing to the cluster database. I started rewriting the whole logic, breaking it down in smaller pieces, streaming data whenever possible. When I first toyed with database access in Scala using Slick I had been so scared by this whole asynchronous stuff (futures and all) that I limited myself to blocking while reading basically the entire database into memory, then going synchronously and sequentially from there. This was of course a recipe for running out of system resources.

Now I went the opposite way, step by step basically streaming and parallelizing the entire data generation process. At the same time, I learned how to deploy it to the cluster, so it could generate away day and night while I could do other stuff on my local machine. Isn’t that what a cluster is for?

I finally got it to work, both of it. And now reading the data from the database and doing the calculations was fast, I mean like really fast. Only a couple of hours. All that remained was writing the calculated sales figures to the database. I let the program run and went home. Just out of curiosity I logged into the cluster after dinner to see if it was done. And was shocked. At this time, after five hours, the entire program, running 16 threads in parallel, had written about 3,000 entries. Of an expected 9 million. I did a quick calculation and found that at this speed the data generation would take two months. Provided the pod did not fail in between and lose all the calculated data.

What had gone wrong?

I retraced my steps the next day and found that in breaking the logic down into tiny steps I had created a program that did two complete database transactions for every single sales figures entry–one for finding the row to update and one for updating it. So, 18 million transactions for writing the entire sales history.

You’ll laugh. Do. But that’s my illustration of how I am learning all this stuff on the go. Trial and error. Painful experiences. Being rather ashamed of how stupid I really am. All the time.

But bottom line is, after a lot of this (and I’m not even mentioning the dozens of stupid little bugs and errors and misconceptions that at times took a couple of hours to find and iron out) I finally had a logic that calculated and wrote my 9 millions price and sales entries to the database in just over 2 hours. In fact, now writing the simulated sales history for my 1,000 or so gas stations took 3 minutes. Instead of a projected 2 months. How is that for speed-up?

So, two weeks after I had said I had all my generated data complete (locally that is), I really did have it available where I really need it–in the database in the cluster that is. But the upside is, I have learned a lot about building and deploying a Scala project to the cloud, about databases in the cloud, about interaction with pods and between pods, and so on. Which will stand me in good stead in the next step: Actually, finally deploying my neural network training to the cluster. Which is what I have been working on this past couple of days or so.

Containerizing the simple neural network training application written earlier to the cluster surprisingly was (again) a bit more involved than I had expected. The various DL4J libraries used created a conflict between transitive dependencies that had to be resolved before SBT would build a JAR. I can’t say I understood what was going on there, and how to actually solve it properly, but again a lot of playing around with workarounds collected from internet forums made the error go away. Now I could build the JAR, but the application once started would fail with utterly cryptic error messages that when all was said and one resolved to this: Alpine Linux, a popular choice for slim Docker images to run Java on, does not use some standard Linux C libraries DL4J depends on. I tried to install those libraries and ended up with a segmentation fault error, at which point I gave up and used a standard Debian Linux image, which however took a full hour to pull. I can see why people find Alpine attractive as long as they don’t need those libraries.

But finally it worked. Now my thinking revolves around the next problems, such as: There has to be a way for the application to not just train one network hardcoded in the source code, but instead different networks, the parameters for which are somehow passed to it from outside. And a way to save those networks somewhere, with their parameters and their performance, so one can access and use them. Or rather, there need to be several instances of such an application, so several networks can be trained at the same time. Afterall, that’s what a cloud is for! So I am thinking about a Redis publish/subscribe, or let’s say queue, system, in a setup very similar to the one we used in the genetic algorithms for neural networks study project in the fifth term. And then save the trained networks in a document database, like MongoDB? This is the stuff I’m working on right now.

Should all this work (and if I don’t run into unexpected problems–but then one always does, doesn’t one?) I might have the basic setup ready in a couple of weeks. Then there would still be some time left to address the bonus question of whether using distributed training on a Spark cluster will somehow speed up the training. And basically then I’d also be done with the programming part of the bachelor thesis project. With a couple of months left to write the thing.

Because as I said I must not lose sight of the fact that I’m going through all this just so I have something I can write about. And even considering that not all of this trial and error tragedy is fit to be described in an academic thesis at full length (and be it only so I can avoid looking like a complete idiot!), I have by now made enough architectural decisions that can be presented in text and diagrams and experienced enough interesting conceptual and technological tradeoffs that it shouldn’t be a problem to fill another 20 to 30 pages. On top of the 20 pages I’ve already written on the data generation process, and then maybe another 20 I can write, by way of an introduction, on the fuel market, algorithmic retail pricing, machine learning, neural networks, plus a few words on actually training the networks, and then in closing some ideas on how this could be adapted to real data, deployed for production (actually predicting something), or expanded to cover more different products so it could be used in a general store rather than a gas station, I’ll have plenty enough to fill a thesis.

In fact, just to take the pressure of, I have begun to sometimes tell myself this: Whether this cloud deployment stuff works or not (because let’s face it, nobody will ever look at my code–I could claim whatever I want), I can start writing the thesis any time, and be probably done three weeks or so later and have a good chance at a good grade. From what I hear people have gotten an A for a lot less. Whenever this technology stuff seems totally overwhelming, I consider that and instantly feel better.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s