Here’s a skeleton of a class I built that loops through and deduplicates data – it’s in C# but the principles of the question aren’t language specific.
public static void DedupeFile(FileContents fc)
{
BuildNameKeys(fc);
SetExactDuplicates(fc);
FuzzyMatching(fc);
}
// algorithm to calculate fuzzy similarity between surname strings
public static bool SurnameMatch(string surname1, string surname2)
// algorithm to calculate fuzzy similarity between forename strings
public static bool ForenameMatch(string forename1, string forename2)
// algorithm to calculate fuzzy similarity between title strings
public static bool TitleMatch(string title1, string title2)
// used by above fn to recognise that "Mr" isn't the same as "Ms" etc
public static bool MrAndMrs(string title1, string title2)
// gives each row a unique key based on name
public static void BuildNameKeys(FileContents fc)
// loops round data to find exact duplicates
public static void SetExactDuplicates(FileContents fc)
// threads looping round file to find fuzzy duplicates
public static void FuzzyMatching(FileContents fc, int maxParallels = 32)
Now, in actual usage only the first function actually needs to be public. All the rest are only used inside this class and nowhere else.
Strictly that means they should of course be private. However, I’ve left them public for ease of unit testing. Some people will no doubt tell me I should be testing them via the public interface but that’s partly why I picked this class: it’s an excellent example of where that approach gets awkward. The fuzzy matching functions are great candidates for unit tests, and yet a test on that single “public” function would be near-useless.
This class won’t ever get used outside a small team at this office, and I don’t believe that the structural understanding imparted by making the other methods private is worth the extra faff of packing my tests with code to access private methods directly.
Is this “all public” approach reasonable for classes in internal software? Or is there a better approach?
I am aware there is already a question on How do you unit test private methods?, but this question is about whether there are scenarios where it’s worthwhile bypassing those techniques in favour of simply leaving methods public.
EDIT: For those interested, I added the full code on CodeReviewSE as restructuring this class seemed too good a learning opportunity to miss.
12
I’ve left them public for ease of unit testing
Ease of writing those tests, maybe. But you are then tightly coupling that class to a bunch of tests that interact with its inner workings. This results in brittle tests: they will likely break as soon as you make changes to the code. This creates a real maintenance headache, that often results in people simply deleting the tests as they become more trouble than they are worth.
Remember, unit testing doesn’t mean “testing the smallest possible piece of code”. It’s a test of a functional unit, ie for a set of inputs into part of the system, I expect these results. That unit might be a static method, a class, or a bunch of classes within an assembly. By targeting public APIs only, you mimic the behaviour of the system and so your tests become both less coupled and more robust.
So make the methods private, mock the “whole ‘FileContents’ DTO” and only test the one true public method. It’ll be more work initially, but over time, you’ll reap the benefits of creating useful tests like this.
6
I would normally expect you to exercise the private member functions via the public interface. In this instance I would write different tests to feed different file contexts in, with different data sets present in order to exercise those methods.
I don’t think your tests should know about those private methods. I think they’re part of the implementation, and your test should be concerned with how your component functions, not how it’s been implemented.
This class won’t ever get used outside a small team at this office
You might be surprised what gets reused as soon as you make it public. An alternative is to accept that making it public will result in re-use, and then pull the methods out into a general utility class. That way you can test those methods separately (since you’ve publicised them) and since it’s a public utility class, you’re not making any implicit assumptions that they’ll only be used in this scenario you’re currently focused on.
3
I think the issue comes from the design. My gut feeling says you either wrote the tests after the code, or that you already had a complete implementation in mind when you started writing the tests, and then you just shoehorned the tests to fit the design.
I think this type of trouble can be avoided by remembering the short TDD cycle of making a small assertion and then making it pass in the simplest way possible.
You speak of it being hard to exercise all the private methods via public methods. This is probably indicative of your class doing too much. If you can’t construct it by adding test after test and just seeing the requirements being met and then refactoring it into maintainable and readable code, then you either don’t have enough knowledge about the entity to implement it, or it’s too complex (yeah yeah, it’s a generalization, I know they’re evil).
By looking at your method names it looks to me like this class has way too many responsibilities. It has several algorithm implementations, it manages threads, it even possibly reads files from disk. Split up your work into manageable reusable chunks. Matching/validation algorithms can easily be injected (as strategies, more more preferably imo, as delegates), thread management should probably occur at a higher level, etc.
When you no longer have a large complicated class with billions of responsibilities (well, more or less), the testing becomes almost trivial.
8
I agree with @BrianAgnew and @kai, but would like to add more than a comment.
While an IDedupeFiler
(or whatever) should be tested through its public interface, the OP has decided there is value in testing the individual sub-routines. Irrespective of file size or line count (which is only a rough proxy count for class responsibilites), the OP has decided that there is too much complexity to test from the top of this class.
This is a good thing incidentally, one of the reasons people like TDD is that the need to test forces the coder to adapt (and improve) their design. It’s valid to point out that the earlier tests are written the earlier this process happens, but the OP isn’t at all far down the road of untestable design decisions and refactoring will not be onerous.
The question seems to be whether the OP should (1) make the methods public and achieve testability with less refactoring, or whether they should do something else. I’d agree with @kai and say the OP should (2) split this class up into isolated, separately testable chunks.
(1) breaks encapsulation and makes it less immediately obvious what the use and public interface of the class actually is. I think the OP acknowledges that this isn’t the best OOP design choice in their question.
(2) means more classes, but I don’t think that’s much of a problem, and it provides testability without design compromises.
If you really disagree with me that your sub-methods represent separate and separately testable concerns, then don’t try to dig into the class to test them. Exercise them through your top public method. How hard you find this will be a good indicator of whether this was the right choice.
1
I think you might benefit from a slight shift in the way you view unit tests. Instead of thinking about them as a way to guarantee that all your code is working, think of them as a way to guarantee that your public interface does what you claim it does.
In other words, don’t worry about testing the internals at all – write unit tests that prove that when you give your class X inputs, you get Y outputs – that’s it. How it manages to do that doesn’t matter at all. In fact, the private methods could be totally wrong, useless, redundant, etc., and as long as the public interface does what it is supposed to do, from the point of view of the unit test, it is working.
This is important because you want to be able to go back in and refactor that code later when a new library comes out that does it better, or you realize you were doing unnecessary work, or you just decide that the way things are named and organized could be made more clear. If your tests are only around the public interface, then you are free to do that without worrying about breaking things.
If you make this shift in the way you think about unit tests and find that it is still hard to write the test for a given class, then it is a good sign that your design needs some improvements. Maybe the class should try to do less – maybe some of those private methods really belong in a new smaller class. Or maybe you need to use dependency injection to manage external dependencies.
In this case, without knowing more details about how you are doing this deduplication, I’d guess that the methods you want to make public might actually be better off as separate classes, or as public methods in a utility library instead. That way, you could define the interface for each one along with its range of allowed inputs and expected outputs, and test them in isolation from the deduplication script itself.
In addition to the other answers, there’s another benefit to making these functions private and testing them via the public interface.
If you gather code coverage metrics on your code, it becomes easier to tell when a function is no longer being used. But, if you make all of these functions public, and make unit tests for them, then they will always have at least the unit test calling them even when nothing else does.
Consider this example:
public void foo() {
bar();
baz();
}
public void bar() { ... }
public void baz() { ... }
And then you make individual unit tests for foo
, bar
, and baz
. So they will all be called by your unit test framework, and they will all have code coverage saying they’re being used.
Now consider this change:
public void foo() {
bar();
}
public void bar() { ... }
public void baz() { ... }
Since your unit tests still call all of these functions, your code coverage is going to say they’re all being used. But baz
is no longer being called by any other part of your software other than the unit tests. It’s effectively dead code.
Had you written it like this:
public void foo() {
bar();
}
private void bar() { ... }
private void baz() { ... }
Then you would lose 100% of your code coverage of baz
when foo
changed, and that’s your signal to remove it from the code. OR, this raises a red flag because baz
is actually a super important operation, and its omission causes other problems (hopefully if this is the case, then other unit tests would catch it, but perhaps not).
I would seperate the function and the test cases to seperate classes. The one contains the function the other contains the test cases which calls the function and asserts if the result is equals to what you expect. I’m not in C but in JAVA I would use Junit for this.
This divides the logic and makes your class more readable. You also don’t have to worry about your problem.
You do not need to declare the methods public just to Unit-test them.
Unit test normally should belong to the same package where the class under testing belongs (of course, inside the different source code folder). As a result, also package private and protected methods can be accessed for testing. See Maven layout, for instance.
While it may not be worth writing very detailed tests for all internals, the visibility scope is not the only indicator to identify the unit.
1