Tuesday, September 6, 2016

Solving Multi-Core Python

tl;dr I proposed a solution for Python's weak multi-core story (note I didn't say "support"), but didn't have time to follow through.  I still have hope, both for the project as a whole and for several standalone parts.

Contents:
  Backstory
  The Proposal
  The Outcome
  The Details


Backstory

For the longest time I've heard the argument that multi-processing is the Pythonic way to get multi-core.  Threads are an anti-pattern, some say.  They are a thing only because Microsoft made them a thing, some say.  I've heard anecdotal evidence from people that actually do
parallel (not just concurrent) computing (e.g. scientific, render farms) that they favor multi-processing approaches, especially because they typically go multi-host anyway.

You could also argue that in practice nearly all concurrent programming is IO-bound and that asyncio solves that for us.  Recent-ish discussions (e.g. PEP 492) lead me to believe that it's not nearly that simple.

Ultimately the merits of multi-processing and asyncio over multi-core threading, regardless of whether or not a valid or sufficient argument, do not mitigate the popular and pervading *sentiment* that Python is weak when it comes to leveraging multi-core and handling concurrency (or more accurately computational parallelism).  And perception ""is 9/10th of the law"" (or arguably higher).

Folks looking for a solution are going to search for one that matches the model in their brain.  Much like with organic molecules, if the conceptual bind points don't line up they aren't going to connect with what Python is offering.  The power of Python is that it maps well onto our brains.  Though concurrency/parallelism isn't very suited to our brains, it is one key place where Python doesn't do a good job of matching conceptual expectations at large.

The Proposal

In short, Python's multi-core story is murky at best.  Not only can we be more clear on the matter, we can improve Python's support.  The result of any effort must make multi-core (and concurrency/parallelism) support in Python obvious, unmistakable, and undeniable (and keep it Pythonic).

Early in 2015 I'd reached my limits with all the criticisms and misunderstandings.  So in the spirit of open source I resolved to do something about it.  Since this was not an area of expertise for me I did a lot of reading and reached out to experts I know (thanks to Guido, Nick, Sarah, Graham, and others).  In a few months I felt like I had a good enough understanding and a good solution.

In June of 2015 I introduced my solution on the python-ideas mailing list.  The gist is to use CPython's existing subinterpreters to isolate GIL-free execution threads with a CSP front end.  My hope was to finish the first stage of work in time for Python 3.6 (i.e. right about now).  The reception was generally positive.  There was even discussion on reddit.  I was encouraged.

The Outcome

Going in I knew it would be a challenging project.  However, the solution I proposed was tractable in the desired time frame, building on a lot of existing parts and decomposing into manageable stages.  The blockers were well understood, mainly involving subinterpreter bugs and PEP 432.  I also received several solid offers for help.  In October I even went to PyCon UK to coordinate some of the efforts.

At the same time it became clear that my life was getting too busy to make much progress.  In early 2016 I decided to table the project.  I wasn't giving up yet and hoped to get back to it.  Furthermore, there were also many parts of the project that stand on there own as useful features.  That's about where things are at right now.

The Details

This is where I try to summarize my proposal and relevant information.

Summary

  • expose subinterpreters in Python (a low-level stdlib module)
  • support passing objects between subinterpreters
  • add subinterpreter serial-execution mode
  • add a high-level CSP module to the stdlib

Phases

  1. resolve blockers
  2. add "interpreters" module
  3. minimal multi-core solution
    • subinterpreter "serial execution mode" (no GIL )
    • channels supporting immutable objects
    • no extension modules
  4. csp module
  5. expanded support
    • support more types in channels
    • performance optimization
    • extension module support (PEP 489 compliant only)

Requirements

  • "make multi-core support in Python obvious, unmistakable, and undeniable (and keep it Pythonic)"
  • no significant impact on single-threaded performance
  • maintain backward compatibility (C-API, etc.)
  • (pseudo-)compatibility with multiprocessing/threading/concurrent.futures APIs
  • a multi-core concurrency model/approach that fits our brains
  • Python APIs
  • supportable on other Python implementations

Blockers

Standalone Improvements

  • faster/cleaner interpreter startup?
  • better multiple-interpreters-per-process support
    • named subinterpreters
    • interpreters module (a la threading)
      • _interpreters module (a la _threading)
    • more efficient sharing between interpreters (e.g. builtins)
    • faster/cleaner subinterpreter startup
      • share some modules
      • leverage object sharing
    • refactor C-API to take interpreter arg
    • leaner subinterpreters?
  • refcounts in own memory page
  • factor out pickle-independent parts of multiprocessing
  • better object immutability
    • truer immutability?
    • immutable mode?
    • frozen objects
    • issue #24991: Define instance mutability explicitly on type objects
    • several PEPs
  • isolated object graphs
    • "isolated" object
    • memory model (all in same page)
    • related to RDM project

Specific Additions


  • channels (a la queue)
    • object sharing between interpreters
      • "immutable objects: int, float, str, tuple, bool, frozenset, complex, bytes, None
      •   - containers (tuple, frozenset) must hold only immutable objects"
      • types that implement __shared__
      • "frozen" objects
      • read-only views
      • "owned" objects (transfer ownership)
    • C channels
    • in own module?
  • PEP-489 slot for subinterpreter support?
  • subinterpreter "serial execution mode"
    • add mode management
    • start each in own thread
    • disallow threading
    • disallow forking
    • eliminate GIL within each subinterpreter
  • csp module
    • inspired by python-csp
    • shared-nothing "thread" concurrency model
    • uses subinterpreters in serial execution mode by default
  • object ownership
  • "Local Interpreter Lock"

Specific Changes

  • drop GIL between interpreters?

Python Alternatives

  • threading
  • multiprocessing
  • asyncio
  • STM (Armin Rigo, PyPy)
  • pyparallel (Trent Nelson)
  • dask
  • gilectomy (Larry Hastings, CPython)
  • otherwise remove the GIL
  • better concurrency primitives for threaded programming
  • add multi-core support to the asyncio event loop
  • better documentation
  • Jython
  • IronPython
  • other Python implementations
  • fibers
  • do nothing

1 comment:

  1. This is great, will it be possible to have subinterpreters with different python binaries (e.g. [cpython <-> [jython] ] ) ?

    ReplyDelete