Is placing text markers inside of strings bad style? Is there an alternative?

03/11/2022 softwareengineering

I work with massive strings which need a lot of manipulation.

For example, I might generate a string like this:

Part 1
Boat

Section A
Programming

Part 2
Partitioning boats for programming.

Section AA
Section SQL Entries.

The string would be too large to manually check every part of it. Now I need to split this string into a stringlist by sections and parts. I can think of two options:

A Regular Expression:

QStringList sl = s.split(QRegularExpression("n(?=Part [0-9]+|Section [A-Z]+)"));

That looks like it should work, but sometimes exceptions slip through (IE: Section SQL Entries would erroneously get split)

Otherwise what I could do is place a marker when I generate the initial string:

??Part 1
Boat

??Section A
Programming

??Part 2
Partitioning boats for programming.

??Section AA
Section SQL Entries.

Which means that splitting the string would become easy:

QStringList sl = s.split("??"));

Something tells me though that neither of these are good style or programming practice, but I have up until this point not discussed it nor found an alternative.

If you were my project manager, would you accept either of these methods?
If not, what would you suggest I do as a best practice?

It’s not bad practice to have document encoding embedded as text in a string. Think of markdown, HTML, XML, JSON, YAML, LaTeX, etc.

What is bad practice is reinventing the wheel. Rather than writing your own text processor, think about using an existing standard. There’s plenty of free software that do much of the parsing for you, and many have a non-restrictive license that let you use said software in your own proprietary software.

Using some common separator should work fine when splitting larger arbitrary strings, but I would recommend against using an arbitrary symbol. Someone reading that string as plaintext could be confused, not to mention troubles with UTF and whether or not the symbol appears inside the sections or not.

The most important part of this is that each section remains intact, while each “section header” needs to be appropriately identified.

Why not use a common separator but keep it readable? Something like:

[SECTION]
Part 1
Boat

[SECTION]
Section A
Programming

[SECTION]
Part 2
Partitioning boats for programming.

[SECTION]
Section AA
Section SQL Entries.

The problem is deciding what the separator should be, as it needs to be something that is guaranteed to not show up any section.
You could further identify it as a separator by requiring it is at the start of a line and the only text on that line.

Without further knowledge of what text is expected in each section it’s hard to make a recommendation on what common separator would be best in this case.

The accepted answer seems to have missed what you wrote in a comment:

The reason is that a lot of the manipulation I do requires the full string

and gave this as an example:

s.replace(“boat”, “programming”);

If that is what you want, it is IMHO a really bad idea to use some “markdown” or textual separator for your whole string, this has always a certain risk to interfer with the manipulation and will not lead to robust code. Especially when you try to start using regular expressions on such a combined string, you will probably run into the same problems people observed when trying to parse HTLM or XML with regular expressions.

Especially because you wrote there might be “thousands of [such manipulation] functions”, that risk might become a real problem. Even if you use some markdown like XML to store the string list internally, you need to make sure the manipulation will process only the content, not the markdown, so that would mean to split the string into parts before you do any processing, and join it afterwards again – so that will have a high risk of giving you a bad performance.

The better design alternative here is to provide an abstract datatype (use a class if you like), lets call it MyStringList, and provide a small set of basic operations which allow you to implement your “thousands of functions” in terms of that operations. For example, there might be generic find and replace operations, or a generic functional map operation. You can also add something like a JoinToString operation if you really need the whole list in one string for certain purporses.

Using these operations, your fear that the code becomes more complicated because “everything would have to be done in a for loop” becomes pointless, because the only for loops you get are encapsulated inside the datatype’s operations. And I would not be concerned about the performance until you have a real, measureable performance impact (which I doubt you get if you implement the basic operations correctly).

The format that is described is very similar to INI files:

https://en.wikipedia.org/wiki/INI_file

In that case the section is enclosed by square brackets [] so what you describe makes sense by marking the section in some fashion to add additional meaning to that text.

For example, I might generate a string like this:

Question: From what do you “generate” this string?

Would that be any easier to manipulate?

LEAVE A COMMENT Hủy