Been thinking this for a while now, for example I have a UX that let’s the user choose what IP to connect.
numericUpDown4 to limit the user to choose numerical values only, then when the user choose
save I will save it as
string then concatenate the dot(
.) to it to become e.g.
Another example is the
port number, a port number is pure
integer but then again it doesn’t make sense to perform calculations on it. Making it a string seems to be wrong also because there are
no characters involve when using port.
Should I only declare the variable as integer if I can perform calculations on it?
You should choose integer over string if the values an integer can have and the operations an integer supports are a better fit for the data in question than the values and operations a string has.
It is okay if some of the values/operations of that type don’t make sense for that data, simply because there are so many different kinds of real-world data that if we tried to make all databases and programming languages have separate built-in types that perfectly matched them, we’d never get any real work done.
First, IP addresses. Assuming for simplicity we only care about IPv4 addresses, then RFC 760 says “Addresses are fixed length of four octets (32 bits).” This immediately tells us that the set of possible values for a 32-bit unsigned integer is exactly the same as the set of valid IP addresses. A string representation would in principle allow all sorts of clearly invalid IP addresses like “9999.9999.9999.-42e5” and “Hello World!” unless we write a bunch of validation code. That alone is more than enough reason to use integers as your “backend” representation of IP addresses, even if the rest of your code prefers to use a string or some object with a pretty printing method to ensure you normally get the “dot-decimal” notation humans like. If another argument is required, note that part of the reason dot-decimal notation for IP addresses is so common is that the four 8-bit components of an IP address often have separate meanings. Thus, we’ll probably want to extract those four separate 8-bit numbers from an IP address from time to time, and taking the first or last 8 bits of a 32-bit integer is a much simpler and faster operation than tokenizing a string.
Then we have port numbers. The TCP and UDP protocols define a port as a 16-bit unsigned integer, so once again, that more or less settles that. But another argument that applies to ports is that there are many important “ranges” of ports, such as 0 to 1023 being the “well-known ports” used by system processes, and these ranges are obviously defined with integer ordering in mind rather than string ordering. No one in their right mind would claim port 50 falls outside the range of well-known ports just because the string “50” is greater than the string “1023”.
You’ll notice that in both of these examples, I did not describe any “calculations” such as addition or subtraction, so the literal answer to your question is “no”. I don’t know of any situation where it would make sense to add two IP addresses or port numbers. Again, most real-world data will never be a perfect fit for any type we give it.
And since I argued for “integer” on both of those examples, let me include a few counterexamples: phone numbers and street addresses. For street addresses I probably don’t even have to make any arguments; it’s just so obvious that no numeric type could ever hope to represent that sort of information adequately. For phone numbers it’s less obvious, but consider the following: the length of a phone number varies by country; the length is always measured in digits, not in bits/bytes/octets; various symbols like +, # and () are sometimes used to represent important information like country codes and area codes; I can’t think of any reason to add, subtract or compare two phone numbers; extracting a country code or area code from a complete phone number is a genuine string tokenization problem we can’t reduce to a bit shift operation because all of the above have variable lengths.
When choosing a type, think about:
- How the variable is used,
- What are the valid/invalid values, if relevant,
- How and where the variable is stored and is efficient storage important.
Example of an IPv4 address you use for filtering HTTP requests from unwanted machines:
- The IP will be used to match it with a range of IP addresses,
- Each part should be between 0 and 255,
Given the first two points, an array of four bytes seem a good choice.
Example of an IPv4 address you store in audit logs (syslog format):
- The IP address will be converted to text,
- The IP comes from a trusted source (the framework used by your application which gives you the IP number of the client); there is no need for further sanitization of values,
- The data is stored as a string.
Therefore, a string format (
192.168.1.5) seems a good choice.
Example of a batch which processes thousands of IPv4 addresses per second, matches them doing an exact match, and requires compact storage of those addresses:
- The IP address is matched to other values. No ranges involved, we just need to know if address A is exactly equal to address B. The addresses are rarely need to be shown in their 0.0.0.0 form and are parsed from this form only once.
- Each part should be between 0 and 255.
- Efficient storage is crucial.
Here, a DWORD looks like a possible way to keep the addresses. It makes it cumbersome to extract individual parts of the address, but given the actual usage, we don’t necessarily need that.
The same works for other types of data as well. Phone numbers are usually stored as strings because it makes it possible to handle different formats (
+33 (0)6 12 34 56 78), because sanitization of phone numbers is usually based on a string anyway, and because storage efficiency doesn’t matter.
If you store national phone numbers which have a well-known format and validation is important and storage efficiency is important (for instance if you need to transfer thousands of phone numbers through a slow internet connection), storing those numbers as a number could be a solution.
I think the question can be applied to numbers in general (and other data types), not just to IP addresses and ports.
Even though numbers are ostensibly just digits, how they’re displayed can vary considerably. A common example is digit groupings. A value such as 1999 is often displayed as 1,999 here in the UK. How this number is displayed elsewhere in the world is locale specific.
Even if you never plan on doing any math on such numbers, the very fact that it is defined as a numeric at source gives developers that come after you a clue as to the type of information that can be stored therein. The same applies to other data types. Say you had a boolean value that was stored everywhere as a string. If a developer found that the data was missing, they may be tempted to store the value as “Unknown” which is clearly beyond the bounds of valid boolean values.
Also (going back to your IP address example), this may well be always passed around as a string but when this is entered, the fact that you know it is made up of numbers makes it easier to validate at source. If the common view was always that this is just a string, then there is some additional validation to do whenever this value is entered into the system.
What I’m getting at is that while it is important to present such information in a handy form, it is also important that the raw data is stored in a commonly agreed format that will be understood across different systems, platforms, languages and locales wherever possible.