The Die Is Cast

The sixth term is in full swing and I am still determined to enjoy it. Afterall, it’s my last.

Both my courses are in fact fun. I find the IT security lecture reasonably interesting (though most of my peers consider it boring), and in any case it’s well presented. I particularly like the crypto stuff we’ve been doing for the past couple of weeks. And the practicum assignments are well doable within a few hours. The first consisted of a number of tiny exercises exploring sniffing, password cracking, cross-site scripting and SQL injection. It took us about four hours, and we would have been a lot faster still had we known that the miniscule in-memory SQLLite database provided by the professor needed writing permissions on the folder it was in.

The second assignment was of the kind so often encountered in the technical CS courses: doing small Java programs with just a single functionality. In this case, write a linear congruential pseudo random number generator (basically a two-liner and a few constants) and use it to implement a stream cipher that can be used one-way for either encoding or decoding (the power of XOR!). Then try to do the same using Java’s SecureRandom class (not possible, because that random number generator is cryptographically strong, i.e. non-deterministic). Finally, implement a cipher feedback block chiffre using 3DES, i.e. the infamous DES encryption (gracefully provided by the professor) three times in a row. Combined, this was about 8 hours of work, and it really drove the contents of the crypto lecture home. I do like that sort of thing.

Today we’re trying to submit that assignment for the professor’s approval ahead of time. I’ll be on a short trip with my wife at the time of our actual practicum meeting and I’d rather have it out of the way. We’ll see.

The other course, the “compulsory choice” module “Certified Tester”, started out rather slow. In fact, even though this is the sixth term week, the lecture has taken place only once when it presented just the basics of testing, very well known to most of us after five terms of programming, software engineering, architecture and so on. Afterwards the professor got violently sick and the second bi-weekly lecture was cancelled. Tomorrow is the third (so, actually the second).

The practicum took place every time, however (supervised by the assistant) and it’s been fun. Above all, it’s so designed that it’s possible (in fact, easily so) to complete the assignments during the meeting itself (something frowned upon by the CS professors who sometimes go so far as the advise us not to show up unless with a completed solution). In the first meeting we explored a small Tic-Tac-Toe Java programm purposely riddled with bugs. We had to find as many as possible without access to the source code. In the next meeting we were given the code and had to identify and fix the errors. I think we all enjoyed that. Kind of like a puzzle. Only the connection to the subject of testing seemed a bit remote, but never mind.

So both practica are totally manageable up to now. A far cry from the murderous monster assignments in distributed systems and architecture last term. Granted, I was wrong in saying in my last post I’d hardly be here anymore. Particularly since pair-programming works well with my present practicum partner–we were here a full day last Thursday doing the encryption stuff. But I do have plenty of time for my bachelor thesis on fuel retail price optimization.

When I last wrote three weeks ago I said I’d probably have my simulated fuel sales data generated within a few more days. That was too optimistic. Yes, I had all the basic components available shortly after, even though the logic for calculating the effect of nearby competing gas stations on the sales volume was more involved than I had thought. In fact, writing my results to the database worked amazingly almost on the first try, even though Scala, Slick, and PostGreSQL were all rather new to me.

But the final step of actually wiring all the components together was still full of surprises. In fact, when everything worked in principle it took me another couple of days to debug the process when it ran on larger samples. Errors cropped up in some parts of the data that I had not encountered with others. For instance, I had not considered that a gas station might not have existed during parts of my time window. I had not realized that some stations might not sell all fuel categories (in fact some rural stations only have diesel). And so on. Each of these errors involved painfully reworking some of the involved sequential logic of the script.

Finally I had it all fixed, I thought. A few days ago I first tried to run the data generation process on the whole sample selected for the project, over 1,000 Northern German gas stations and their entire price history in the year 2015 (that’s several tens of millions of entries).  The first two or three runs threw exceptions quickly, revealing yet more minor bugs. But then it finally seemed to work. For an hour (!) I sat there holding my breath, reading the debug log messages of my script. It didn’t crash on reading the data. It didn’t crash on computing the average hourly prices for all those stations and all those days in the year. It didn’t crash on calculating 9,000,000 simulated sales entries.

It crashed when writing the computed results to the database. Out of heap space.

That was a bitter disappointment, but I quickly recovered and decided to just do that computation in batches for individual post code areas. That solved the problem, at the expense of some accuracy because now a station just across a post code border might not be identified as a competitor of a station inside that border. But it’s fake data anyway, so why worry? I am probably already overdoing this.

In fact, I took a few days off from programming last week to instead start writing up my thoughts and experiences on this step, as a first (in the end, more likely, second–but you have to start somewhere) chapter for the thesis. My thinking was I’d better do this while the memory is fresh, plus I’d get the setup figured out. In fact I fought for a couple of days with the LaTeX template provided by the department, which is so old (2009!) to be practically useless, and then with a newer, yet unofficial template that’s a lot better but inexplicably insists on using the obsolete NatBib library for bibliography management rather than the now standard BibLaTeX. That was frustrating for a while, but finally I got it all figured out and started writing.

And writing, after 25 years in the humanities, is of course second nature to me. I still took a few hours to get into the mood and familiarize myself with the data generation and gasoline market language, but then I was back in the flow. The official template has generous margins and line pitch, and once I had thrown in an ample supply of tables and illustrations, the chapter was nearly 20 pages long. When the recommended target size of the entire thesis is 60 to 80 pages. Oops.

Well, as I said I’m probably overdoing it. On the other hand, the next step is getting the technology stack to work, which is likely to need a lot of time and effort but provide very little one can actually write about. So it’s probably alright that the data simulation chapter ended up being a little longer than expected.

On Friday I finally got access to my Kubernetes (cloud container framework) cluster. As I said, I always find that scary, and rightly so, as the events of the next couple of hours proved. The setup chosen by my colleague administering the cluster required me to SSH onto the control machine, so that was already an unfamiliar level of abstraction. Surprisingly with a little help from the internet I nevertheless quickly managed to actually deploy a few sample applications to the cluster. But then I had to start thinking about how to actually get my few gigabytes of database tables to the cluster for the machine learning setup. The best I could come up with was having a database in the cluster and repeat the entire data generation process there (afterall, now that it worked that would be no problem).

But how do you run a database in a Kubernetes cluster? I tried just starting a pod with an official PostGreSQL image. No problem, except how do I now acccess that dabatase? It seemed much easier to just install PostGreSQL on the control machine. It’s a CentOS machine and I’m not familiar with that particular Linux distribution, but I got it to run. Yet the control machine has only 8 GB of RAM, and even the simplest SQL query on my database (like count the number of rows in a table) took forever. That wouldn’t do.

So I tried installing a PostGreSQL cluster in the Kubernetes cluster. And broke the cluster. So thoroughly in fact that the administrator (who was, to be fair, also doing this for the first time in his life) could not repair it and had to set it up from scratch.

Bad start.

I’ll try again this week. But in fact I am still unsure as to what exactly I am supposed to be gaining by doing my machine learning in the cloud. The combined RAM of the entire cluster is the same as that of my local machine. Is that really worth the effort? I am still very tempted to just do this entire thing locally. Stream the data so to keep the memory footprint small. It should work.

Anyway this feels all kind of ambivalent for me right now. I still have over four months or so, and not a lot left to write about, so investing a few weeks into getting this machine learning thing set up in the cloud could be worthwhile and I might learn a lot. If it works. Big IF.

By the by, I had a somewhat anxious time the last couple of weeks trying to figure out how much time, exactly, I have left. My employer wants me to start full-time in September. But it suddenly occurred to me that even after I handed in the thesis, it would still take my supervisors some time to read it and schedule the oral exam. Just how much? Particularly since it’s going to be during the summer term break when people tend to be on vacation?

I tried to ask my professor, but she didn’t reply to my emails for over a fortnight. That sort of made my very jittery. Granted, I still haven’t officially registered my thesis project, but two months into it I am sort of invested, and time is of the essence. The prospect of trying to meet my employer’s deadline with a professor who takes her time responding was seriously uncomfortable. I finally accidentally met her in the cafeteria this morning and she said she just had problems catching up with her email backlog after having been sick for a while, but in any case she’d be available even during the term break, so no problem. I’m still glad though that meanwhile I had told my employer that September might be a long shot and could we leave it open whether it’d be September or October? Takes some of the pressure out of the thing.

In any case, right after that I went to settle on a final title for the project, get my professors signature, and now I’ll make the thing official this Thursday (one can do that only once a week). After that I’ll be seriously committed. Three months minimum. Six months maximum. But I don’t plan on taking until late October with this. Anyway the die is cast.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s