In Python, there is a standard library module urllib.parse
that deals with parsing URLs:
>>> import urllib.parse
>>> urllib.parse.urlparse("https://127.0.0.1:6443")
ParseResult(scheme='https', netloc='127.0.0.1:6443', path='', params='', query='', fragment='')
There are also properties on urllib.parse.ParseResult
that return the hostname and the port:
>>> p.hostname
'127.0.0.1'
>>> p.port
6443
And, by virtue of ParseResult being a namedtuple, it has a _replace()
method that returns a new ParseResult with the given field(s) replaced:
>>> p._replace(netloc="foobar.tld")
ParseResult(scheme='https', netloc='foobar.tld', path='', params='', query='', fragment='')
However, it cannot replace hostname
or port
because they are dynamic properties rather than fields of the tuple:
>>> p._replace(hostname="foobar.tld")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.11/collections/__init__.py", line 455, in _replace
raise ValueError(f'Got unexpected field names: {list(kwds)!r}')
ValueError: Got unexpected field names: ['hostname']
It might be tempting to simply concatenate the new hostname with the existing port and pass it as the new netloc:
>>> p._replace(netloc='{}:{}'.format("foobar.tld", p.port))
ParseResult(scheme='https', netloc='foobar.tld:6443', path='', params='', query='', fragment='')
However this quickly turns into a mess if we consider
- the fact that the port is optional;
- the fact that netloc may also contain the username and possibly the password (e.g.
https://user:[email protected]
); - the fact that IPv6 literals must be wrapped in brackets (i.e.
https://::1
isn’t valid buthttps://[::1]
is); - and maybe something else that I’m missing.
What is the cleanest, correct way to replace the hostname in a URL in Python?
The solution must handle IPv6 (both as a part of the original URL and as the replacement value), URLs containing username/password, and in general all well-formed URLs.