Why migrate?
Python 2 came to the end of life (EOL) on the 1st of January 2020. It marks the end of bugfix support or even security patches for Python 2 from Python maintainers. Code for the final Python2 release 2.7.18 ( happened in April 2020) was also frozen in January 2020. As a well used cross-browser test suite for the Web-platform stack, the web-platform-tests (WPT) Project uses python in many places, from infrastructure to test scripts. From maintenance and support for active development points of view, It’s imperative for WPT to make its code PY3 compatible sooner than later.Challenges
Both the dynamic quality of the Python language and the complexity of the WPT present significant challenges to the upgrade.Language challenges
Python is a dynamically typed language. There are no formal semantics for Python. As its de facto reference implementation, CPython maintains high coding standards but is not written with legibility as its primary focus. This means that code paths in Python can contain illegal semantics that are hard to detect even with non-static analyzers. Python 3 is a new version of Python, but it’s not backwards compatible with code written for Python 2. The nature of the changes between Python 2 and Python 3 are not just syntactical, rather, many of the changes are in the semantics. In particular, string literals are fundamentally different types in Python 2 and Python 3. Along with the change in the nature of the language, library support has also shifted. Many older libraries created for Python 2 are not forward-compatible. A lot of recent developers are creating libraries that can only be used with Python 3. We can run tools such as caniusepython3 to take in a set of dependencies and then figure out which of them are holding us up from porting to Python 3. The tricky part though, is find and port the new libraries that will work.Project challenges
WPT is a massive suite of tests (over one million in total), and serves many auxiliary functions. It uses Python in many places including but not limited to:- The majority of the infrastructure code. This is the code underlying the major wpt command, such as ‘wpt runner’ etc..
- WPT file handlers, which test authors can define to run custom code in response to them making a particular request to the WPT server.
- WebDriver tests, which use pytest structured tests.
- Linting
- Interacting with the docker, CI systems
- Rebasing expectations, …
The Porting Plan
The WPT community was well aware of the challenges of moving to Python 3 for the project. It set principles, suggested possible approaches and planned timelines before and during the major practical work took place.Principles
- The migration work should happen in the background since the project is quite active.
- The pathway to Python 3 was to make code dual Python 2 and Python 3 compatible and gradually switch over the runtime to Python 3.
- The porting should not reduce test coverage without explicit agreement from test authors.
Approaches
To make the porting tractable, it was decided to start with two very specific goals, each approaching the problem from different angles. One was to get the actual runner utility up running in Python 3, by starting to get a basic ‘wpt run‘ command to execute under Python 3. The other was to target wider test coverage via tests by running all relevant unit tests under Python 3.TimeLines
For a project of non-trivial size like WPT, flag day transitions from Python 2 to Python 3 were simply not viable at the early stage of the project. Before 2020, there were already a few in-depth discussions and work going on within the community for the migration work. The major work, though, happened in 2020. As the porting progressed, the timelines had got clearer. A concrete timeline of dropping Python 2 support in WPT was set in September 2020:- “Py3-first” targeting 2021-01-01 : switch test runs to Python 3 on CI, but keep running unit tests and infrastructure tests in Python 2 and 3.
- “Py3-only” on 2021-02-01: drop all Python 2 tests from CI, and start accepting Python 3-only changes.
Implementations
Porting test runner utility
As we mentioned earlier, one of the starting points was to have the actual runner utility, ‘wpt run’ command to execute under Python 3. This porting was pretty straightforward. We came across some typical python 2 to python 3 migration issues such as- absolute imports. Absolute imports have become the default in Python 3 and relative imports should be explicit. For example, “
from conftest import product, flatten
” in Python 2 needs to be declared as “from .conftest import product, flatten
” in Python 3.
- built-in types comparison. In Python 3 most objects of built-in types compare unequal unless they are the same object. The choice of whether one object is smaller or larger than another one is made arbitrarily but consistently within one execution of a program. In Python 2 in the case of ‘mismatched’ types, the types are listed lexicographical by type name, e.g. a “list” comes after an “int” in alphabetical ordering, so is greater. For example, in Python 2, we have
latest_release = 0
version = [int(item) for item in m.groups()]if version > latest_release:
latest_release
as
latest_release = (0,0,0)
- API changes. There are some API changes between the two versions. For example, the changes of the optional parameter
strict
inHTTPConnection()
. In Python 2 we havehttplib.HTTPConnection(self.host, self.port, strict=True, **conn_kwargs)
. In Python 3 it has becomeHTTPConnection(self.host, self.port, **conn_kwargs)
- order of
dict
. In Python 2,dict
is organized via a hash-table and puts the keys into buckets according to theirhash()
value. in Python 3.6+,dict
retains insertion order. One solution to make code work for both versions is to use the alternative typeOrderedDict
instead of the original Dict in Python 3. - iteration. Python 3 changes the return values of several basic functions from list to iterator. The main reason for this change is that using iterators usually causes better memory consumption than lists. This change has little impact on common use cases. Furthermore, the
iter*
counterparts (which return iterators in Python 2) have been removed. To make code work for both version, we can call six library APIs and replace them withsix
.iter*
to avoid memory regression in Python 2. This corresponds todictionary.iteritems()
in Python 2 anddictionary.items()
in Python 3. six is a Python 2 and 3 compatibility library. It provides utility functions for smoothing over the differences between the Python versions with the goal of writing Python code that is compatible on both Python versions. We called the six library APIs at a few places during the dual Python 2/3 compatible stage. These API calls were removed after WPT transferred to python 3 only. Bytes
vs.str
. In python2, binary is basically an alias ofstr
. In python3 the binary data is different to a string. We had to convert some binary data to string type in order to be compatible for both Python 2 and Python 3. This issue, at the utility script level, presented different challenges from that in the core level we are discussing in the next section. Most cases in the utility script can be resolved by adding prefix to quoted string literals. Quoted string literals can be prefixed with“b”
or“u”
to get bytes or Unicode, respectively. In another word, prefix a native string with“u”
in Python 2 to get a Unicode object while prefix with“b”
in Python 3 to get bytes. It is also noted that in Python 3, the“u”
prefix does nothing. Likewise, the“b”
prefix does nothing in Python 2. In the context of this blog, we are talking about prefixing a native string with“b”
to get bytes in Python 3 in most cases.
Handling string types in core
One of the biggest hurdles in our porting effort was how to overcome the string literals type mismatch between Python 2 and 3 in core, specifically in infrastructure and file handlers. As we discussed earlier, in Python 2, a string literal is a sequence of bytes. In Python 3, a string literal is a sequence of Unicode code points. The rationale behind the change was to move to a Unicode-by-default world. Web Platform Test Server (wptserve) often intends to use byte sequences. To overcome this mismatch hurdle, we need to either always usebyte
sequences or always use str
. [RFC49] has illustrated pros and cons for both approaches. It was decided within the community to go the byte sequence path in order to keep a consistent and semantically correct encoding model. That is to always use byte sequences: str
in Python 2 and bytes
in Python 3. This had incurred some noticeable changes in WPT core. In wptserve
- It introduced a pair of
ISO-8859-1
encode and decode helper functions. Both of them can accept either binary or text strings, but always return binary/text strings respectively regardless of the Python version. - Most public APIs for custom handlers can only accept and return binary with notable exception of the response body.
Writing Python 3 compatible tests
According to the guideline, rule of thumb for porting is to make sure all strings are either always text or always bytes; all string literals in handlers should be prefixed with"b"
or "u"
.
Headers of request and response
Header data should always be binary strings for both keys and values. Prefer adding"b"
prefixes to encoding/decoding.
- The Request.headers dictionary-like interface (accessed via […], get, items).
headers = [(b"Content-Type", b"text/html")]
if b"allow_csp_from" in request.GET:
headers.append((b"Allow-CSP-From", request.GET[b"allow_csp_from"]))
- The Request.headers.get_list method example:
assert isinstance(headers.get_list(b'x-bar')[0], bytes)
- Response.headers.{get,set,append,update,items} examples:
response.headers.set(b'Access-Control-Allow-Origin', request.headers.get(b"origin"))
response.headers.append(b"Access-Control-Allow-Origin", b"*")
HTTP Basic Authentication
Request.auth.{username,password} are binary strings. For example,response.headers.set(b'Access-Control-Allow-Origin', request.headers.get(b"origin"))
response.headers.append(b"Access-Control-Allow-Origin", b"*")
response.headers.set(b'Content-type', b'text/plain')
content = b""
Cookies
- Request.cookies (similar to Request.headers; it’s a MultiDict with all APIs of dict plus first, last, get_list). For example,
response.content = request.cookies[b"foo"].value
- Response.{set,unset,delete}_cookie.
response.set_cookie(b"name", b"value")
response.unset_cookie(b"name")
Request URL/form parameters
- Both the keys and values of URL/form parameters for the request (accessible via request.GET or request.POST) are all binary strings. Prefer adding “b” prefixes to encoding/decoding.
b"realm" in request.POST
request.GET.first(b"type", None) == b"value"
Response Status Message
- Response status message is binary string as follows.
response.status = 401
response.headers.set(b'Status', b'401 Authorization required')
response.headers.set(b'WWW-Authenticate', b'Basic realm="test"')
Response body
The data put into the response body can be either text or binary strings, but the two types should never be mixed and string literals must be prefixed.response.writer.write(b"This is a body!")
return u”Hello, 世界!”