Designs and practices to guard against erroneous null entries from database

  softwareengineering

One part of my program fetches data from many tables and columns in my database for processing. Some of the columns might be null, but in the current processing context that is an error.

This should “theoretically” not happen, so if it does it points to bad data or a bug in the code. The errors have different severities, depending which field is null; i.e. for some fields the processing should be stopped and somebody notified, for others the processing should be allowed to continue and just notify somebody.

Are there any good architecture or design principles to handle the rare but possible null entries?

The solutions should be possible to implement with Java but I didn’t use the tag because I think the problem is somewhat language-agnostic.


Some thoughts that I had myself:

Using NOT NULL

Easiest would be to use a NOT NULL constraint in the database.

But what if the original inserting of the data is more important that this later processing step? So in case the insert would put a null into the table (either because of bugs or maybe even some valid reason), I wouldn’t want the insert to fail. Let’s say that many more parts of the program depend on the inserted data, but not on this particular column. So I would rather risk the error in the current processing step instead of the insert step. That’s why I don’t want to use a NOT NULL constraint.

Naively depending on NullPointerException

I could just use the data as if I expect it to be always there (and that should really be the case), and catch resulting NPEs at an appropriate level (e.g. so that the processing of the current entry stops but not the whole processing progress). This is the “fail fast” principle and I often prefer it. If it is a bug at least I get a logged NPE.

But then I lose the ability to differentiate between various kinds of missing data. E.g. for some missing data I could leave it out, but for others the processing should be stopped and an admin notified.

Checking for null before each access and throwing custom exceptions

Custom exceptions would let me decide the correct action based on the exception, so this seems like the way to go.

But what if I forget to check it somewhere? Also I then clutter my code with null checks which are never or rarely expected (and so definitely not part of the business logic flow).

If I choose to go this way, what patterns are best suited for the approach?


Any thoughts and comments on my approaches are welcomed. Also better solutions of any kind (patterns, principles, better architecture of my code or models etc.).

Edit:

There is another constraint, in that i am using an ORM to do the mapping from DB to persistence object, so doing null checks on that level would not work (as the same objects are used in parts where the null does not do any harm). I added this because the answers provided so far both mentioned this option.

8

I would put the null checks in your mapping code, where you build your object from the result set. That puts the checking in one place, and won’t allow your code to get halfway through processing a record before hitting an error. Depending on how your application flow works, you might want to perform the mapping of all the results as a pre-processing step instead of mapping and processing each record one at a time.

If you’re using an ORM then you’ll have to perform all your null checks before processing each record. I’d recommend a recordIsValid(recordData)-type method, that way you can (again) keep all the null-checking and other validation logic in one place. I definitely wouldn’t intermingle the null checks with the rest of your processing logic.

4

It sounds like inserting a null is an error but you are afraid to enforce this error on insertion because you don’t want to lose data. However, if a field shouldn’t be null but is, you are losing data. Therefore the best solution is to ensure that null fields are not erroneously getting saved in the first place.

To this end, enforce that the data is correct in the one authoritative, permanent repository for that data, the database. Do so by adding not null constraints. Then your code might fail but these failures immediately notify you of bugs, allowing you to correct issues that are already causing you to lose data. Now that you are easily able to identify bugs, test your code and test it twice. You’ll be able to correct bugs leading to data loss and in the process, greatly simplify downstream processing of the data because you won’t need to worry about nulls.

2

In regard to this sentence in the question:

This should “theoretically” not happen, so if it does it points to bad data or a bug in the code.

I have always appreciated this quote (courtesy of this article):

I find it amusing when novice programmers believe their main job is preventing programs from crashing. I imagine this spectacular failure argument wouldn’t be so appealing to such a programmer. More experienced programmers realize that correct code is great, code that crashes could use improvement, but incorrect code that doesn’t crash is a horrible nightmare.

Basically: it sounds like you are endorsing Postel’s Law, “be conservative in what you send, be liberal in what you accept”. While great in theory, in practice this “robustness principle” leads to software that is not robust, at least in the long term — and sometimes in the short term as well. (Compare Eric Allman’s paper The Robustness Principle Reconsidered, which is a very thorough treatment of the subject, albeit mostly focused on network protocol use cases.)

If you have programs that are incorrectly inserting data into your database, those programs are broken and need to be fixed. Papering over the problem only allows it to continue to get worse; this is the software engineering equivalent of enabling an addict to continue their addiction.

Pragmatically speaking, though, sometimes you do need to enable “broken” behavior to continue, at least temporarily, especially as part of a seamless transition from a lax, broken state to a strict, correct state. In that case, you want to find a way to allow the incorrect insertions to succeed, but still allow the “canonical” data store to always be in a correct state. There are various ways to do this:

  • Use a database trigger to convert malformed inserts into correct inserts, e.g. by replacing missing/null values with defaults
  • Have the incorrect programs insert into a separate database table that is allowed to be “incorrect”, and have a separate scheduled process or other mechanism that moves corrected data from that table into the canonical data store
  • Use query-side filtering (e.g. a view) to ensure that the data retrieved from the database is always in a correct state, even if the data at rest is not

One way to sidestep all of these issues is to insert an API layer that you control between programs that issue writes and the actual database.

It sounds like part of your problem is that you don’t even know all the places that are generating incorrect writes — or that there are simply too many of them for you to update. That’s a scary state to be in, but it should never have been allowed to arise in the first place.

As soon as you get more than a handful of systems that are allowed to modify data in a canonical production data store you are going to be in trouble: there is no way to centrally maintain anything about that database. Better would be to allow as few processes as possible to issue writes, and use those as “gatekeepers” that can preprocess the data before inserting as necessary. The exact mechanism for this really depends on your specific architecture.

5

Are there any good architecture or design principles to handle the rare but possible null entries?

Simple answer – yes.

ETL

Carry out some up front processing to ensure the data is of sufficient quality to go into the database. Anything in the drop file should be reported back and any clean data can be loaded into the database.

As someone who has been both poacher (dev) and game keeper (DBA), I know from bitter experience that 3rd parties simply will not resolve their data issues unless they’re forced to. Constantly bending over backwards and massaging data through sets a dangerous precedent.

Mart/Repository

In this scenario, raw data is pushed into the repository DB and then a sanitised version is pushed to the mart DB which applications then have access to.

Default values

If you can apply sensible default values to columns then you should although this can involve some work if this is an existing database.

Fail early

It is tempting to simply address data issues at the gateway to the application, report suite, interface etc. I’d strongly advise you not to solely rely on this. If you hook in some other widget to the DB you’ll potentially be faced with the same problems again. Address the data quality issues.

1

Whenever your use case allows to replace NULL safely by a good default value, you can do the conversion in the SELECT Sql statements using ISNULL or COALESCE. So instead of

 SELECT MyColumn FROM MyTable

one can write

 SELECT ISNULL(MyColumn,DefaultValueForMyColumn) FROM MyTable

Of course, that will only work when the ORM allows to manipulate the select statements directly, or provide changeable templates for the generation. One should make sure that no “real” errors are masked this way, so apply it only if replacing by a default value is exactly what you want in case of NULL.

If you are able to change the database and the schema, and your db system supports this, you may consider to add a default value clause to the specific columns, as suggested by @RobbieDee. However, this will also require to modify the existing data in the database to remove any previously inserted NULL values, and it will remove the ability to distinguish between correct and incomplete import data afterwards.

From my own experience, I know that using ISNULL can work surprisingly well – in the past I had to maintain a legacy application where the original devs had forgotten to add NOT NULL constraints to lots of columns, and we could not easily add those constraints later for some reasons. But in 99% of all cases, 0 as default for number columns and the empty string as default for a text columns was fully acceptable.

4

The OP is presuming an answer that couples business rules with database technical details.

This should “theoretically” not happen, so if it does it points to bad data or a bug in the code. The errors have different severities, depending which field is null; i.e. for some fields the processing should be stopped and somebody notified, for others the processing should be allowed to continue and just notify somebody.

This is all business rules. The business rules do not care about null per-se. For all it knows the database could have null, 9999, “BOO!” … It is just another value. That, in a RDBMS, null has interesting properties and unique uses is moot.

The only thing that matters is what “null-ness” means for the given business object(s)…

Are there any good architecture or design principles to handle the rare but possible null entries?

Yes.

  • Put business rules in classes.
  • The transliteration should be in an appropriate code layer decoupling the business classes and data store. If you can’t put it in the ORM code at least don’t put it in the database.
  • Make the database dumb as possible, no business rules here. Even innocuous things like defaulting a value will bite you. Been there.
  • Validate data going to and coming from the database. And of course this is done w/in the context of the business objects.

Throwing an exception upon data retrieval does not make sense.

The question is “should I store ‘bad’ data”? It depends:

  • Bad data might be used – Never save invalid objects, or object composites. Complex data/business relationships all over the place. Users can do any function at any given time, possibly using that business entity in a number of contexts. The effect (if any) of bad data, at the time it is saved, is not known because it is highly dependent on future use. There is no unified/single process of that data.
  • Cannot progress if there is bad data – Allow saving of bad data. However the next step in a process cannot continue until everything is valid. For example doing one’s income taxes. When retrieved from the database the software points out the errors and it cannot be submitted to the IRS without validity checking.

There are many ways to handle nulls, so we’ll move on from the database layer up to the application layer.


Database layer

You can forbid nulls; although here it is impractical.

You can configure a default on a per column basis:

  • it requires that the column be absent from the insert, so does not cover explicit null insertion
  • it prevents the detection from rows where the insert erroneously missed this column

You can configure a trigger, so that upon insertion the missing values are automatically computed:

  • it requires that the necessary information to perform this computation be present
  • it will slow down the insert

Query layer

You can skip rows where an inconvenient null is present:

  • it simplifies the main logic
  • it prevents detecting the “bad rows”, so another process would be needed to check on them
  • it requires that each query be instrumented

You can provide a default value in the query:

  • it simplifies the main logic
  • it prevents detecting the “bad rows”, so another process would be needed to check on them
  • it requires that each query be instrumented

Note: instrumenting each query is not necessarily an issue if you have some automated way of generating them.


Application layer

You can pre-check the table for forbidden null:

  • it simplifies the main logic
  • it improves the time-to-failure
  • it requires keeping the pre-check and application logic consistent

You can interrupt the processing when encountering a forbidden null:

  • it avoids duplicating the knowledge of which columns can be null and which cannot
  • it is still relatively simple (just a check + return/throw)
  • it requires that your process be resumable (if you already sent an e-mail, don’t want to send it twice, or a hundred times!)

You can skip the row when encountering a forbidden null:

  • it avoids duplicating the knowledge of which columns can be null and which cannot
  • it is still relatively simple (just a check + return/throw)
  • it does not require that your process be resumable

You can send a notification when encountering a forbidden null, either one at a time or by batch, which is complimentary to the other ways presented above. What matters most however is “what then?”, most notably, if you expect the row to be patched and in need of being re-processed you may need to ensure that you have some way to distinguish already processed rows from rows in need of being re-processed.


Given your situation, I would handle the situation at the application and combine either:

  • interrupt and notify
  • skip and notify

I would tend toward just skipping if possible to somehow guarantee a modicum of progress, especially if the processing can take time.

If you do not need to re-process the skipped rows, then simply logging them should be sufficient and an e-mail sent at the end of process with the number of skipped rows will be an apt notification.

Otherwise, I would use a side-table for the rows to be fixed (and re-processed). This side-table may either be a simple reference (without foreign key) or a full-blown copy: the latter, even if more expensive, is necessary if you do not have the time to address the null before having to cleanup the main data.

Nulls can be handled in the translation or mapping of database types to language types. For example in C#, here’s a generic method that handles null for you for any type:

public static T Convert<T>(object obj)
        {
            if (obj == DBNull.Value)
            {
                return default(T);
            }

            return (T) obj;
        }

public static T Convert<T>(object obj, T defaultValue)
        {
            if (obj == DBNull.Value)
            {
                T t = defaultValue;
                return t;
            }

            return (T) obj;
        }

Or, if you want to perform action…

 public static T Convert<T>(object obj, T defaultValue)
        {
            if (obj == DBNull.Value)
            {
                //Send an Alert, we might want pass in the name
                //of column or other details as well
                SendNullAlert();
                //Set it to default so we can keep processing
                T t = defaultValue;
                return t;
            }

            return (T) obj;
        }

And then in the mapping, in this case to an object of type “Sample”, we will handle null for any of the columns:

public class SampleMapper : MapperBase<Sample>
    {
        private const string Id = "Id";
        private const string Name = "Name";
        private const string DataValue = "DataValue";
        private const string Created = "Created";

        protected override Sample Map(IDataRecord record)
        {
            return new Sample(
                Utility.Convert<Int64>(record[Id]),
                Utility.Convert<String>(record[Name]),
                Utility.Convert<Int32>(record[DataValue]),
                Utility.Convert<DateTime>(record[Created])
                );
        }
    }

Finally, all the mapping classes can be automatically generated based on the SQL query or tables involved by looking at the SQL data types and translating them to the language specific data types. This is what many ORMs do for you automatically. Note that some database types may not have a direct mapping (Geo-spatial colunms etc.) and may need special handling.

4

LEAVE A COMMENT