softwareengineering

A recurring pattern which I see in my code is chaining together a lot of functions. This is the result of a large number of processing steps needed for a given task. This could be e.g. a data processing and visualization or my most recent example:

For some financial analysis I want to read in a couple of pdfs which are bank statements, extract the text, get transactions by using regular expressions and do some conversions from strings to the proper data types. After that, I would analyze those transactions in some kind.

After some problem solving, I ended up with something like this in my main function:

transactions = []
for page_number, page in enumerate(pages):
    num_of_pages = len(pages)
    text_in_lines = extract_page_to_text(page_number, page, num_of_pages)
    matches = match_dates(text_in_lines)
    groups = assume_groups(matches)
    curr_transactions = create_transactions(text_in_lines, groups)
    transactions.extend(curr_transactions)

This code is for a single pdf (so it would be wrapped in another for loop) and I haven’t done any analysis with the data yet. My issue with this is, that I’m pretty quickly starting to lose any structure if I’m just chaining together those functions.

So my thought is, there should be a design pattern for this kind of problem, right? Long chains of functions with some for loops with not a lot of conditionals should be easy to structure in theory?

One possibility of refactoring I thought of, was to split up the functions into classes. A PDF class, a Page class and a Transaction class. Is OOP the way to go, even if it means just to gather a bunch of functions in a class and then calling them from inside the class? But this opens up new questions for me again – how do I structure those classes? Do they exist in parallell and in the end, it’s the same but I now call methods instead of functions? Or do I nest them inside of each other something like this:

class Pdf:
    def __init__(self, pdf):
        self.pdf = pdf
        self.pages: [Page] = None

    def get_page_content(self):
        # do something with pdf to extract content
        for page in pages:
            self.pages.append(Page(page_content))

class Page:
    def __init__(self, page_content):
        self.page_content = page_content
        self.transactions: [Transaction] = None

    def get_transactions(self):
        # do something with page to extract transactions
        for transaction in transactions:
            self.transactions.append(Transaction(transaction_raw))

class Transaction:
    def __init__(self, transaction_raw):
        self.transaction_raw = transaction_raw

    def some_further_processing(self):
        ...

This looks better in my opinion, but I’m nesting classes three levels deep and it seems like a lot of OOP just to chain some “simple” functions together.

You may have noticed that I’m kind of lost on this topic, so looking forward for any kind of feedback 🙂

2

The term you’re looking for is “processing pipeline”.

More generally we might have a
DAG
of data dependencies, as with
make or
airflow.
But the simplest DAG is just a sequence of processing steps
in a straight line.


Unix | pipelines are a powerful means of composing
simple functions over text records.

Analogously in python we can write generators which yield
a sequence of records to subsequent processing stages.
They compose
quite nicely.


Obligatory code review remark:

for page_number, page in enumerate(pages):
    num_of_pages = len(pages)

Better to
hoist
that constant out of the loop, as it isn’t changing:

num_of_pages = len(pages)
for page_number, page in enumerate(pages):

Or even better, save typing a few characters by
simply using that expression as third argument,
since it’s cheap to compute and needn’t be cached.


Your pipeline is already well structured.

Notice that once we have extracted a list
of text lines, we no longer refer to page.
This suggests a cleavage point where we
might break out a stage of the pipeline.

So stage1 iterates over pages and yields text line lists.
And stage2 iterates over such lists to come up with
transactions, which can then be conveniently collected
by a list comprehension.
Or those records might be sent onward to a third stage of processing.

What does that accomplish? It reduces
coupling,
letting local temp vars go out of scope so
we’ve fewer things to worry about at any given stage.
And the clearly defined interfaces make it easier
to create unit tests for individual stages
in the middle of the pipeline.

4

LEAVE A COMMENT