Two Horrible Days

I just lived through the two worst days in these nearly five terms of studies. So far.

My programming partner and I met on Tuesday morning for figuring out the integration testing assignment for the architecture practicum. It took us an hour to realize that the test project would never work with my installation of the IntelliJ IDE. Even after updating to a new minor version, something remained seriously broken. I finally budged and installed the hated, but trusted, alternative, Eclipse, which I had dumped three terms ago without ever looking back, and then we could at least begin to work.

It was a positive nightmare, and remained one for the next two days. You see, the integration test, once uploaded to Gitlab, starts the runner environment on the Gitlab server using a script which then launches the actual deployment of the three services on the Kubernetes cluster and executes the test. At least that’s what I make of it, because fortunately that is something my programming partner took care of (in his turn, I understand, mainly copying code from the deployment genius in our group, tutorials  provided by the assistant, and the internet). For the test to pass however, all three services have to be available, able to find one another, and be found by the test. This in turn means correctly resolving the URLs (internet addresses) configured dynamically by the environment (which was a precondition of the assignment).

And that’s where the nightmare started. Because half of the time this simply didn’t work. And small wonder, because we are dealing with three different URLs for each service. There is the simple one that just gave the service name, as in “some-service”. Supposedly Kubernetes is able to resolve that, given the correct number of sufficiently opaque .yaml files. Then there is a fully qualified URL that states the cluster name, our namespace, some other qualifiers, and the service name. And there is one two lines long that starts with localhost and proceeds through a lot of cryptic acronyms separated by slashes to finally end in the servicce name and “proxy”, which is the one by which you can reach the service from outside the cluster. We used that last, long one for querying the services via curl (the command line) and for local testing, whereas supposedly the middle one is the one the first, short one resolves to when used within the Kubernetes namespace.

Confused? Yeah, so am I. In fact I can’t even find words to describe just how confused I am.

So that’s what we fought with for the rest of the day, and I for the whole next day. Sometimes I used the wrong URL. Sometimes the services just wouldn’t find one another. Sometimes they would, but give the wrong replies. The crux is, in an environment so complex and confusing and so little accessible it’s extremely hard to know just what goes wrong when an interaction test fails. It might be their service, or ours, or the test, or just Gitlab or the cluster acting up. At the end of day one I finally had a very simple test that actually passed. Not a green pipeline (a pipeline is a continuous integration tool that builds the software in a container and runs the tests, signalling green if all is fine), because the tests run by the other service still failed, but a passing test. So I left with the feeling I had one more hurdle taken.

On the way home I began to have doubts. Actually my “integration” test didn’t test a lot of interaction. We downloaded data from the other service, changed it, then downloaded it from our service and checked if it was indeed changed. In fact, the test didn’t explicitly interact with the other service at all. It just supposed that the regular pushing and pulling of data that I had programmed into the service had indeed taken place in between the checks, but just as well we might only be checking if we had properly changed our own data. So I changed the test to allow time for that exchange of data to take place and then to query the other service directly for its data, even though I thought doing  that was a no-no for an integration test.

The result was depressing. The test showed that nothing worked, at all, ever. We were not even talking to the other service. And with that new and depressing insight I had to do a normal evening at home with the family, even attend choir practice, all the time knowing that we had to hand in the assignment 36 hours hence and that the entire next day would be taken up with lecture and practicum for the compulsory choice module.

At least next morning I had an inspiration regarding our not talking to the other service: The method that should regularly pull and push data to and fro was triggered by a  scheduling annotation in the Spring Java framework, but I had forgotten to enable scheduling in the application, so it was never happening. That was easy to fix. But first we went to the meeting of the next group of the distributed systems practicum and finally, belatedly, handed in our solution to the second assignment sheet. That took an hour and all the time I was missing from the process mining lecture. When I finally went there I learned that my co-students had already talked to the professor about the examination date (remember, it has been advanced from week two to week one) and she wouldn’t budge, saying she had a professional development training booked in the second week. So it will definitely be three exams over three days in the first week, for me. I was seriously angry. And spent the entire rest of the lecture hardly listening and instead fighting with my integration tests. And the lunch break. And the first half hour of the process mining practicum.

Which was another let-down. So far the practicum had been a small number of simple assignments that could easily completed within two hours or so, so there never was any actual homework. At least one practicum that didn’t run totally rampant. And all of sudden the assigment for that day asked us to find ourselves a data source, some event logs available on the internet, start our own process mining project, and in the next and final meeting do a 20-minute presentation on our results. Really? That late in the term, and with a horrible first exam week looming ahead, suddenly an assignment that required creativity and time effort. And when I realized that it was 1 p.m. and the presentation for the architecture practicum was scheduled for 9 a.m. next morning, and my tests were not passing. Not even close.

I feel bad about it, but I really had problems concentrating on the practicum assignments. That day, I know, I let my practicum partner down, something I had never done before. We spent two hours half-heartedly clicking through the internet and a process mining tool, not getting anywhere. I was glad when we finally decided to postpone working on this assignment until next week and I could return to my pipeline.

It  remained a nightmare. Nothing worked. The environment was entirely erratic. In one build, everything worked, except the test didn’t pass. In the next, the other service wouldn’t reply at all, or only with a 400 error code. It was decidedly the lowest moment of my entire studies. For the first time I looked total failure in the face, not as a vague fear anticipated for a distant future, but as the naked realization that the presentation was only 15 hours away and there was nothing I could do to make these stupid tests pass.

I  went home in a furious rain and continued to work. Or was it work, or just desperation? Experimentally changing things and changing them back, only to see that nothing moved that stupid red pipeline. One small step ahead was that I found an error in the other service’s code which they first denied was there, but then fixed. Sadly that error, if sorted out earlier, had resulted in a passing test many hours before, because there had been one that passed until failing on that final assertion.

And finally, at 7:30 p.m., I decided to simply hardcode the URL for the other service, in violation of the requirements for the assignment, instead of reading it from the environment. And suddenly the test passed!

I was relieved, triumphant, and baffled, all at the same time. I couldn’t explain why this tiny change should have made a difference, but was ready to accept it, just to have a passing test the next morning, however gained. And mind you, this was after I had reverted to the original testing setup that tested basically nothing at all. At that desperate moment that was good enough.

Then I looked at the code again and realized that my change had absolutely no effect. I had hardcoded the URL, but forgotten to delete the statement that read it from the environment. And that statement came later so it overwrote the earlier one. Effectively I hadn’t changed one iota in the actual code.

Yet the test had passed.

Leave it alone? There was a green pipeline, after all. We could hope to pass, with nobody ever noticing that the test didn’t really test anything.

I couldn’t. Desperation changed to fatalistic boldness. I added all the “real” tests back in, one after another. And the pipeline was still green. One final test. Checking if the other service had actually saved the changes we sent them.

And the pipeline was red.

And it failed at a point that had nothing to do with the change. In fact, it failed much earlier in the code.

The environment was decidedly erratic. Nothing I did, apparently, had anything to do with success or failure of the pipeline.

So I re-ran the pipeline. And it passed. With no change in the code, at all.

It was 8 p.m. I told all my teammates to leave the pipeline alone and went to eat with my kids. I was relieved and totally baffled. Had I had the worst day in the last two years and a half just because Gitlab or the Kubernetes cluster was acting up? Could the same set of tests also have passed the morning before if they hadn’t? In fact, quite mysteriously two times in between the pipeline had already been green, but I could never reproduce it.

We presented our work this morning. My teammates talked at length about deployment and yaml files and all that, demonstrated the green pipeline, and I talked through my integration test in 1 minute or so. The assistant nodded, the professor didn’t say anything at all, and we got our checkmark.

Was that worth it? What I learned about microservices in getting this integration test to pass is simply that they are hell. All the deployment stuff had been done by my teammates, so I still have no idea of that. And the professor himself said in the lecture that microservices are a passing fad in the real software industry world. Just not efficient and resilient enough for major enterprise applicaitons. I can imagine! So the contents of the practicum, again, have little to no relevance either to the exam or to the real world. My suspicion is that, just like with the distributed systems practicum, it’s the assistant rather than the professor who in his excitement to try something new was responsible for sending us through this over the top technology nightmare.

This term I do, occasionally, hate this course of studies.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s