This is a question which I have wondered (and been asked) about for a long time.
In (most? all?) programming languages, an index begins at zero for an array, string, etc. I recognize it became convention over time, adopted in many languages, but can anyone point to the origin of this?
I thought, perhaps, it had to do with all being rooted in binary. But I am not sure of the idea carrying to the necessity in the decimal system — why not start an index from 1?
Does anyone have historical knowledge of programming languages where the decision to begin indexes at zero may have been explained?
EDIT: The Dijkstra writings are further helpful from a mathematical standpoint, but even has he noted, not all languages are zero-indexed. WBT’s explanation also makes sense as to why one would start with zero based on memory addresses. (I know some languages handle indexing slightly different based on array manipulation.)
I’m not necessarily looking for the why (which I very much appreciate because it helps further an understanding) but more along the lines of when did this become the convention and/or whether it can be traced to a specific language.
So, for instance in K&R’s C, when discussing array indexes, K or R matter-of-factly explains, “Array subscripts always start at zero in C…” (p. 22) Later, in discussing a function to process character arrays, “… a more useful design would be to return the length of the line, or zero if end of file is encountered. Zero is an acceptable end-of-file return because it never is a valid line length.” (p. 127)
Based on K&R, I gather a) the convention is adopted from elsewhere, so C is not the inspiration behind zero-indexing and b) there are possibly deeper reasons for its use based on the second example. I know K&R is so widely regarded for its clear prose, so that’s another reason I include it, to give an example of what I had hoped another documented language would do to explain the reason behind zero-indexing.
I think both WBT and btilly offer equally good reasons; I wondered if anyone who perhaps knew old (pre-C?) languages which documented the design decision. And at the same time I recognize such information may not exist.
It’s about offsets. You have an address, which points to the location in memory where the array begins. Then to access any element, you multiply the array index by the size of the element and add it to the starting address, to find the address for that element.
The first element is at the starting point, so you multiply the size of the element by zero to get zero which is what you add to the starting address to find the location of the first element.
The convention spread because programmers started working in very low-level languages where memory addresses were directly manipulated and in most cases building up from there, maintaining the same convention at each step so that they wouldn’t have to relearn or be prone to mistakes when switching between conventions. It’s still important to understand how this addressing works especially when working with lower-level languages. I agree this can be a stumbling block for people who are first learning to program in a higher-level language.
The Wikipedia article on this topic also cites a common machine instruction used when working “backwards” and detecting the end of a loop, namely “decrement and jump if zero.”
An exception: MATLAB and some other languages bucked the trend and went with an index starting at 1, apparently under the impression that it would be a first programming language for a lot of their target users and that for those folks, starting with 1 makes more intuitive sense. This causes some frustrations for the (relatively small subset of?) programmers who frequently switch between programming languages that start counting at different values.
The statement “In (most? all?) programming languages, an index begins at zero” is simply not correct. Those languages whose heritage derives formally or informally from C follow this convention. Others may not.
C did it that way because C was fundamentally intended to be a “high-level” assembler. It put a fair burden of the workload on the programmer, where other languages had the compiler and the machine do the heavy lifting. At the time that C was developed, 1-based counting was the norm, but requiring the compiler to keep track of that silly extra 1 was considered to be too much work for the compiler.
C++ got it from C because of the requirement that C++ be backward-compatible (some might say bug-compatible) with C. Java got it from C. Languages developed by C programmers with no significant exposure to anything else copied C, because they either wanted to be popular with other C programmers or they didn’t know any other way to do it.
FORTRAN, which predates almost everything else out there, started at 1, because engineers, mathematicians, and scientists have been counting starting at 1 for millenia. (This allows a very concise, very nice algorithm for the 8-queens problem.) MATLAB copied FORTRAN, as it was aimed at almost precisely the same user community.
PASCAL actually requires the programmer to say where he starts and finishes, allowing one to define, for example, and array whose indices run from, say, -7 to +7. Ada followed PASCAL. (Mentioning Ada should be good for at least three downvotes right there.)
I believe COBOL started at 1, but I do not recall for certain, and I have no intention of refreshing some very painful memories, because accountants, like engineers, scientists, and mathematicians, start counting at 1.
It is my distant recollection that PL/I allowed you to start and stop wherever you liked. Full Disclosure: I’ve never done PL/I coding, just skimmed a book, and I have no intention of changing that.
I never used arrays in GPSS (IBM’s discrete event simulation package), during my brief exposure to it, so I can’t tell you how GPSS did it.
Assembly languages typically started from 0 because arrays are traditionally defined in terms of a starting address and an offset from the starting address. (This is not always the case. The IBM 1130 Executive had a large resident vector table, whose “starting address” was actually in the middle of the table. They did this because the 1130 indexed addressing allowed signed offsets, requiring offsets to start at zero would have thrown away half of the possible size of the table, and that table NEEDED to be big.)
Trying a short answer.
Counting from zero is popular not just in programming languages but in mathematics more generally speaking.
Counting is much older than the zero. Since the zero and positional notation were invented, everyone counts 10s, 100s, 1000s etc. from zero: it’s the new lowest digit. Counting units from zero too brings a few consistency advantages, notably with half-open intervals and (multi-dimensional) arrays. For more details and examples see links on the right side and https://en.wikipedia.org/wiki/Zero-based_numbering
Every possible convention of counting has been tried. The counting from zero convention has become dominant because the alternatives tend to be more accident prone.
See https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html for one explanation of why this version works out better.