Legacy code and Test Driven Development (Part 2)

Yesterday, I wrote about legacy code and how it really was not possible to initially do test driven development, aka TDD. In this post, I will detail a strategy which I feel is good at explaining how to go about testing. I make no guarantees that this is the best way to go about it (please feel free to suggest other tricks you might use), but it works. But these are useful for me in porting Medusa V2 code to Medusa V4 (long story about why not V3, but there are reasons). And while this is from a PHP/Laravel perspective, the same applies to situations whether you are using Java and JUnit, Python and its tools or some other language/toolset.

Get to know your data
Regardless of whether you are talking about an API to a network or telephony switch, some other device or a database, you must first understand your data and the API to get at it. For relational databases residing in say MS SQL Server, PostgreSQL or MariaDB, the API is just your normal database programming interface and the data can be determined by looking at the database schema, and then digging deeper into things like JSON fields. For a MongoDB instance, the entire database may not have any schema defined whatsoever, and you will be forced to extract the data from the database, in some cases writing tools to do so for you. If someone has added schema validation for you (see this fine article for an example of how to do that when creating the database under Laravel), and you can access that validation document, great... you have a leg up in beginning your job. But if not, depending on the database, you may have a heavier lift. This is not to say that a product such as SoftTrax will make it easy when you do have a relational database, but it could be much worse!

This need not be done all at once, either. You may take it a table/collection at a time, and you may do so continuously while working on a specific model (if you are dealing with a MVC or similar architecture). But only by knowing your data can you present reasonable arguments to the code you are testing, and know how to setup a thorough set of assertions to see that the code returned what it is supposed to return.
Create seeders/factories to seed your tests.

Whether you are talking about tests which run in 100% isolation from the database and the APIs, or you are using a test database, you NEVER want to run your unit and functional tests against production data. Not only does it put that data at risk and pollute your production environment with test data, that data is constantly subject to change and you risk your tests breaking. Nor is just restoring your production data into a test database a starter for TDD. Sometimes, your data is of such a size that restoring your data takes minutes, if you are lucky, if not hours with special servers to restore. In two previous jobs, our database was so large that it would take days to do a restore. You want to have your tests run on a well known set of data, and to run quickly, taking just a few seconds at most per test (and ideally multiple tests per second) for TDD.

Even in the case of Medusa, it is difficult to use a copy of production data. A restore takes a good number of seconds (maybe as much as a minute), and to run your tests in isolation from one another is just not possible. Moreso, let us just face it... it is difficult to know large datasets as well as you need to be able to write tests in a reasonable amount of time. As of this writing, there are 706 chapters and almost 7800 members. Can you wrap your head around that many database records/documents? I sure cannot do that. At a minimum, you are going to need to exercise your ETL (Extract-Transform-Load) skills to create seeders and factories. And while doing so, this gives you a chance to remove any PII (Personally Identifying Information) from the tests, so that you can at some point just hand off the code to be worked on by somebody without having to get NDAs, security clearances or other preventative measures involved to work on the code. Right now, I have used ETL to produce seeders and factories to give me just 22 chapters and 15 users, where a chapter may run 100 or more users. Imagine writing the assertions that all 100 users of a chapter are returned in alphabetical order! And each test, even when seeding say the chapters and users, still runs in roughly a second against a test database.

Yes, some of you may be thinking "Why not mock the database calls?" While that certainly is possible for unit tests and something to be considered, can you imagine setting up the mocks to traverse a tree of even 5 chapters? Besides, executing against a small test database allows you to verify your SQL or other database interactions, and helps you prevent writing code which does not execute properly when the code hits the database. Save the mocking for error/exception injection, unless you can easily and simply simulate it with a trick such as using a bad key. More on this, but first...
Expect to keep going back to update your unit tests as you update your seeders/factories.

You will find that in later tests, you might need to add a new user or chapter to properly test something, only to find that you broke the tests for a method you wrote earlier. But don't just blindly update your tests... think about it to make sure it makes sense, because there is always a chance you exposed a bug or edge case instead.
Take it easy on the mocking!

I once worked for a company where the expectation within the department was that you would test methods in 100% isolation from other methods which might be called, even within the same class. Such thinking ends up with you potentially producing code with SQL errors or methods which do not work together. I have come up with an analogy for this... You can write unit tests for wings, and a boulder or pig with such a high degree of isolation and have everything work, putting wings on a pig or a boulder most certainly is not going to make either fly well on their own. Instead, consider the class as a whole as the unit, and write the tests the public methods, and perhaps the protected methods. But, just because the tests for a single method gets you to covering half a dozen other public methods is not a reason to get lazy. Write the tests for those methods as well, and get used to running a subset of the tests. For PHPUnit, such an execution looks something like this:

coverage tests/Unit/ChapterTest.php --filter='testGetChapters[^B]'

which runs all the tests which start with testGetChapters followed by the capital B in just a a couple handfuls of seconds. This is the cycle time which you want to look for for TDD, so you can quickly turn around and do things like adding type hinting or cleaning up the code for a method to improve it. After all when all the coverage tests for a class may take minutes, that is way too long to have to wait to find out that your newest unit test failed.

One caveat to all this, however. If you are talking a complex function which calls a bunch of other methods which say talk to the API of a network/telephony switch, by all means, mock those sorts of methods. Learn their signatures, required parameters and returned data, best done while writing the tests for those methods, then incorporate what you learned into the mocking of the methods. And don't be afraid to inject or return large XML or JSON documents. Just make sure that those documents reflect reality. But don't mock for mocking sake.
Learn how to mock your OS system calls.

There is nothing like having a test rely on a date() call to cause your head to swell and burst. But rather than doing a call wrapper like this:
```
protected function mydate(string $format): string {
    return date($format); 
} 
```
And calling the wrapper function, instead do what I did in this test. Here, I take advantage of the namespace, and most system calls can be injected in this way, since unless you put the backslash in front of the system call, it will resolve into the current namespace rather than the system namespace.
Be thorough with your assertions statements.

While some talk about "one assertion per test", sometimes it makes more sense to use the same arrange/act code and instead use multiple assertion statements to make your overall assertion... that the results are a very specific thing.
Know your tools, and don't be afraid to use var_export() and other tools at your disposal.

In legacy projects, when putting code under test, using var_export() or similar methods to get to know what your code returns is essential to getting it under test. This is not to say blindly copying data from a var_export() to create expectations is a good idea... still think about whether or not the data makes sense, especially given your test dataset. After all, odds are that your existing code and UI are your ultimate requirements.

Given all this, I think it is fair to say that names like "Brownfield development" for this sort of effort are good ones. Imaging trying to use a hoe on dirt which has been baked by the sun for years. The longer the baking, the harder it is going to be to hoe that row to have a nice, fertile growing code space again.

Legacy code and Test Driven Development (Part 2)

Recent comments

Recent content

Monthly archive

About This Site