In many languages, the substring function works like this:
substring(startIndex, endIndex)
returns the substring from startIndex until endIndex-1 (if you view startIndex and endIndex as 0-based) / from startIndex+1 to endIndex (1-based)
This is confusing. I understand that the two parameters can be interpreted as “startIndex” and “length of the substring”, but in my opinion that is still confusing and even in this case, startIndex is 0-based while length is 1-based.
Why not stick to one convention for both the function arguments? and why do newer languages like ruby and python continue to stick to this standard?
3
The second argument is not the “length of the substring”, that only works if you start at the very beginning of the string. The point is that specifying “from”..”to” is linguistically ambiguous: you name two limit values, but do you want those two values themselves included in the extracted range or not?
In normal parlance, there isn’t a strong conventional preference: “I knew her from first to fourth grade” means “for four years”, but “training is from one to three” means “two hours”, not three.
Therefore, it would be a major source of confusion if the indices referred to characters in a string. What they really refer to is the positions between the characters: 0 means “before the string”, 1 means “after the first character”, 2 means “after the second character” etc. Therefore, s.substring(0,2)
means “The first two characters of s”, unambiguously. (The fact that endIndex - startIndex == length(extract)
is admittedly a nice bonus.)
2
There are reasons why Python and Ruby use the convention you describe. These reasons may not be immediately evident — but experience, and reasoned explanations (transcribed) by well-respected language designers, give good reason to think that zero-based, half-open array ranges are the least error-prone of many possible options.
@KilianFoth’s answer gives a common and useful mnemonic for visualizing and using this convention. But the reason newer languages use the convention is because it works.
For situations where [3,6] would mean the third, fourth, and fifth characters, but not the sixth, I would suggest that it may be helpful to imagine the characters as occupying space on a number line, with indices representing points between the characters. The first character would sit in the space between 0 and 1, the next between 1 and 2, then 2..3, etc. The range [3,6] would include the spaces from 3..4, 4..5, and 5..6. By convention, using just a single subscript n will refer to the range (n)..(n+1).
Viewed in this fashion, it becomes clear that [3,6] and [6,8] refer to two touching but not overlapping ranges of characters, since they occupy touching but non-overlapping space on the number line.