cthulix.com index preview status consulting
preview
Cthulix: the future of distributed-computing.
November 2020
Pitch
(Cthulix <=> MPI) is as (Unix <=> Multics)
Introduction
Think of Cthulix as a toolkit for building event-driven distributed
systems that scale well.
Conventional techniques grow infrastructure around algorithms. In
Cthulix, algorithms grow into an infrastructure.
Build To Scale
This is how some businesses create software systems,
1. Algorithms. A vanguard developer or quant prototypes some
algorithm code, and then iterates this towards a solution.
2. Deployment. They pass the software to another team, who are
responsible for deploying it.
3. Integration. Interactions with other systems are an adhoc
afterthought.
The result is unwieldy, operationally inefficient, and scales poorly.
(See "Machine Learning: The High Interest Credit Card of Technical
Debt", https://research.google/pubs/pub43146/)
Operations teams erect scheduling, configuration, build and monitoring
systems as an exoskeleton around algorithm code. But those
arrangements are fundamental to a platform, and should be the
skeleton.
Algorithm code should be citizens of the infra, not directors of it.
The Cthulix approach,
1. Deployment. Cthulix provides a clear model for boot, init,
monitoring, configuration, upgrade and resource management.
2. Integration. Cthulix sets clear rules about interactions
between systems.
3. Algorithms. Developers and quants build code to live within
this paradigm.
Cthulix guides developers to build systems that scale well and behave
as good neighbours.
Quants and developers benefit from this. Their systems scale better,
they don't get weighed down running quirky systems of their own
creation.
Technology Overview
Each Cthulix System is a distributed computer that operators can ssh
into.
The Shell
The shell resembles Bourne Shell. You can navigate using cd and
ls, you can ./run_commands.
There is a mount point for interacting with the Kernel.
Grid-applications (Gridapps) are mounted as directories in the
Cthulix filesystem, and expose commands to the shell.
APIs
Developers can expose functionality to third-party systems as
async protocols.
The Kernel
The Kernel tracks the resources of associated physical hosts -
RAM, cpus, gpus, etc.
Gridapps request resources from the Kernel.
A Cthulix system is divided into Rings,
Kernel Ring. This manages hardware allocation, service
discovery, permissions, the mount tree, and hosts the shell.
Gridapp Ring. Each Gridapp has its own Ring.
The Kernel launches a Gridapp by installing the Gridapp's Init
service into a compute node.
The Init service requests resources from the Kernel, and raises
and tends Processes within its Gridapp.
The Kernel ring also supplies a logging backbone, a monitoring
service, and maintains the shell.
The Cthulix Kernel is rafted so that a cluster can survive the
unexpected loss of a physical host.
Gridapps can coexist within a Cthulix System.
Operators can tell the Kernel to add and remove compute during
normal system operation.
Under the hood
Each Cthulix host runs a stripped install of Linux. The Linux
kernel has introduced io_uring in 2020. This overcomes historic
async deficiencies of that platform.
The reference implementation is built with Arch Linux, but you
could build a BSD container if you wanted to. (You would prepare
your own TFTP/PXE image.)
Each logical core in the cluster can host a pegged Linux process.
This is a Process. As each Process starts, it registers with the
Kernel (if it is in the Kernel Ring, or an Init process) or with
its Init process (in other cases).
Within a Cthulix System, Processes talk to one another using typed
messages. Cthulix provides this mechanism, the developer controls
the schema.
Each Cthulix System has a skeleton of standard services: Init,
Monitoring, Analysis, Logging, Configuration. You can override the
system to use your own implementations.
The main language is Rust. Rust added coroutines in late 2019.
This enables something new: coroutine-based development in a
performant, gc-free systems programming language.
Requirements for Cthulix: a TFTP server, a set of physical hosts
configured to PXE-boot, a network backbone connecting the hosts.
The project will supply detailed documentation to help you set
this up, but you are free to substitute in your own arrangements
if you want to.
Hardware is not necessarily tied to a single Cthulix System. You
could configure a physical host to be split across several Cthulix
Systems. (e.g. for blue/green deployment.)
The Standard
The modestly titled 'Generally Accepted Cthulix Principles' (GACP)
lays out the design rules. This outlines the domain language, sets
rules for message-contracts, and mandates the shape of a standard
deployment.
It does not care about parentheses-placement or unit test coverage.
Industry Language
People involved in distributed computing use a nuanced language - HPC,
HTC, grid computing, concurrent computing, parallel computing, cluster
computing, supercomputer.
Citation-heavy Wikipedia entries encourage the reader to view each
category as a necessary specialisation.
This author disagrees. At this time, commodity hardware has a major
role to play in all those domains. Linux has a major role to play in
all those domains. The separations persist because there is no
effective governance in the way that teams build grid software.
Decades ago, there was a meaningful difference between 'server' and
'workstation'. Around 2000, commodity hardware and Linux became
good-enough that the distinction became academic.
Cthulix will cause a similar shakeup in the domain language around
distributed computing.
Compare to Standard Approaches
Let's discuss three approaches that you could use to build a
distributed system in 2020.
Third-party scheduler and NFS.
* Easy to set up, operations-light
* Weak at managing jobs with varied needs, runtime.
* Poor for workloads with data-locality needs.
* Poor for workloads involving multi-stage queries.
* Vulnerable to IO bottle-necking as you scale.
Apache ecosystem. (Hadoop, Spark, Fink, Kafka, Zookeeper, etc)
* Complex, ever-changing ecosystem.
* Difficult to reason about what the computer is doing. (JVM)
* Moderate hardware efficiency is achievable.
* Brittle contracts between constituent systems.
* Operations-heavy.
MPI-based architecture, commonly used in supercomputers,
* High barrier to entry.
* Experienced developers achieve strong hardware efficiency.
* Complex API
Now, the candidate system.
Cthulix-based development,
* Easy to set up, operations-light
* One developer can fit the whole model in their head
* One operations engineer can fit the whole model in their head
* Clear path to strong hardware utilisation
* Malleable. Grow adaptable systems that solve many problems.
Highlights of the Cthulix model
Network programming is a hard-won, specialist skill. Cthulix gives
devs powerful, async network mechanisms through a straightforward API
and spares them needing to become network programming gurus.
Due to its pure async model, a Cthulix dev can still reason about the
whole system, down to cache behaviour on individual cores.
Cthulix allows you to mix different job profiles in a single hardware
cluster (e.g. gpu-heavy + cpu-intensive large jobs + high-volume,
latency-sensitive ram-large jobs).
Standard-library features. e.g. for storing and filtering data.
Redeployment is atomic, fast and reliable.
Cthulix is operations-efficient. An operator who understands the
Cthulix model can maintain a variety of systems built in that
tradition, and can discover a system that they have not seen before.
Due to its simple model, you can expect to get high value out of your
operations staff.
As your Cthulix infra grows, you benefit from economies-of-scale.
The Author
The author has a decade of experience building platforms for trading
firms, and worked as a custom-shop developer for the decade before
that.
When designing a system, his focus goes to interactions, delivery and
deployment, rather than algorithms. There is an analogy with how you
assess a house. The obvious focus is kitchens and bathrooms. Old hands
focus on foundations and drainage.
[Contact details removed - not currently accepting new clients]
--------------------------------------------------------------------------
Queries
Q: "What is 'Init'?"
Startup and temporal scheduling.
In modern unix, you have (bsd-init + cron). In Cthulix, we have a
single concept that covers those things, and we call that Init.
Q: "Can you contrast MPI and Cthulix?"
MPI is shallow but wide.
Shallow: it is focused on messaging only.
Wide: it supports a diverse set of use-cases.
Cthulix is deep but narrow.
Deep: it has a view on build pipeline, programming language,
deployment model, and supplies a skeleton of standard services
for logging and analytics, and a console. And, it offers
messaging functionality.
Narrow: at each level, it focuses on a small number of
powerful techniques.
MPI is not opinionated.
It gives developers the widest possible range of options for
how systems should interact with one another.
Cthulix is opinionated.
It gives clear guidance about how systems should interact.
Let's explore that, by working through the Cthulix guidance.
Cthulix distinguishes between these interactions,
1/ Communication that happens between Processes that are
part of the same Cthulix Ring instance. ("Internal
Communication")
2/ All other communication. ("External Communication")
Building on this, it directs the developer on how Message
Contracts should work,
1/ All "Internal Communication" methods are defined in a
typed messaging schema that is Ring-specific. These nodes
will never suffer protocol-version confusion, because
nodes on the same ring are always running the same
software version.
2/ All "External Communication" must be done through a
Well-Defined API.
3/ Well-Defined APIs should be established as a
/bilateral/ contract between the participating deployment
units.
Here are some features that MPI supports, which are outside
Cthulix guidance,
'Dynamic process management'. This would be inconsistent with
Cthulix Message Contract rules. A Cthulix developer should
never need to do this: (1) Cthulix Rings and Init already give
nodes the network connections they need; (2) Having external
systems reaching past that model would be chaotic.
'Shared Memory Programming'. Shared state leads to complexity
and bugs. In Cthulix, the developer would use message-driven
concurrency to achieve the same goals. The message-driven
approach is easier to reason about, and the result would be
easier to maintain.
Cthulix has the advantage of vertical integration. Since it has a
strong view on the design of the Deployment, it is able to set
tight rules for Message Contract. It seeks to offer a small number
of powerful techniques that cooperate well.
There may be scenarios where an optimal MPI solution will slightly
outperform an optimal Cthulix solution. But the Cthulix solution
will be faster to build, faster to iterate on, less vulnerable to
bugs, and better to support.
This may resemble the assembly/C dynamic. Hypothetically, a well
written assembly program is faster. But even people who care a lot
about speed prefer to work in C.
Q: "Hey - Cthulix is not really an operating system, right?"
It is a Single System Image (SSI), but nobody outside of academia
uses that term.
Developing against Cthulix feels like developing against a
conventional OS system api, but everything is async, networking is
easier, and you are free of single-host assumptions.
When you ssh into a Cthulix system, it feels like you are
interacting with a single coherent system, even though that
machine may be spread across hundreds of physical hosts.
Q: "What if I want to subdivide my cluster?"
You can do this.
A typical situation: you want to guarantee that some resources are
dedicated to a tenant application; but also want other tenants to
be able to compete for other resources opportunistically.
Cthulix incorporates a resource manager.
Q: "Why are you using Linux?"
The author has experience at building a protected mode kernel in
assembly and C. You can get fine-grained control through this
approach. But it has become an impractical approach to
supercomputer development due to driver coverage.
GPUs are significant to the grid-computing space, and the only
viable options at this time are Windows and Linux. (Apple will
probably establish themselves as a third option.)
Once you accept that, it is inevitable that Linux must play a role
in this platform.
Linux is free-to-use and flexible. It runs on a variety of cpu
architectures. It has a full-featured tcp/ip stack. It allows
processes to have dedicated cores. As io_uring improves, the
platform becomes capable as a true async platform.
Hence, Cthulix uses Linux as its 'driver layer'. Physical hosts
network-boot a stripped Linux image that contains just enough
logic to attach itself to a Cthulix grid.
Q: "What is released so far?"
So far, there has been no release of code.
Q: "What are the license arrangements?"
Intent is to License Cthulix under GPL2.
Custody of the Cthulix system will gradually move to cthulix.org.
Cthulix.com will remain dedicated to commercial consulting around
related topics.
Q: "Why coroutines rather than threads?"
1. On recent hardware, threads have a performance disadvantage
because CPU cache has grown important for performance, and
thread-switching invalidates your CPU cache. (This is not
explicitly true, but is true for practical purposes)
2. It is easier to reason about what is happening on the metal in
systems built with asynchronous coroutines than in multithreaded
systems.
3. Coroutine code reads easily.
Q: "Is this like Apache Yarn?"
There is overlap.
Apache Hadoop is built on top of a separate layer called Yarn.
Yarn is responsible for resource management, and node control.
Similarities,
Both offer flexible resource management.
Both are multi-tenant. (e.g. you can run several gridapps in a
single Cthulix System)
Where Cthulix has a 'Gridapp Init Service', Yarn has an
'Application Submission Client'.
You could build something Cthulix-like on top of Yarn. But that
system would be tied to the JVM ecosystem. Developers would not be
able to program using coroutines, and it would take more effort to
reason about what was running on the metal than in the (Rust)
Cthulix model.
Q: "Is there some equivalent of HDFS in Cthulix?"
No.
Brace yourself for a heresy.
The things that we know as Files are generally regarded as a
multilateral contract.
Cthulix is all about bilateral contracts, and hence discourages
Files and multiparty filesystems.
Rather than having a filesystem, you would create an ETL that had
multiple inputs (one from each producer) and multiple outputs (one
to each consumer).
Future work: build a gridapp that makes it easy to create
bilateral File-like things.
Q: "What if I need to resource-manage custom FPGA hardware?"
That is fine.
The Kernel contains a resource manager. Via the API, you can
customise it to track new types of resource. Or, you can
replace the resource manager module in the kernel with your
own implementation.
This applies broadly in Cthulix. If you want to replace the
logging subsystem, you can switch it out for your own module that
satisfies the same interface.
Q: "What Parallel Programming Model does Cthulix use?"
None. Cthulix provides the infrastructure, and then the developer
implements the PPM they want. The end-developer makes algorithm
choices.
With that said, the Cthulix design will steer developers towards
message-passing approaches.
IO bottle-necking is a key concern in distributed systems. A
developer who keeps close to the messages is less vulnerable to
this, and better able to engage with situations that do emerge.
Still, if you wanted to build a map-reduce like thing on Cthulix,
you could do this.
Q: "What do you think about the Dursi paper?"
Jonathan Dursi's paper, "HPC is dying, and MPI is killing it"
(https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html)
may be the most influential thing to be written about distributed
compute in the last decade.
It is critical of message-passing as a general approach to
distributed computer problems. We contend that (1) an expressive
and macro-capable language like Rust restores relevance to
message-passing techniques and (2) coroutines+exceptions supply a
clear path to working yourself out of single-host failure
scenarios.
Q: "Does Cthulix support [active messages / lambda-passing]?"
Not currently.
There is less need for it in this architecture than some, because
every Process in a Gridapp has the same code. If the instruction
comes from within the Gridapp, you can enumerate the function to
be run, which is more efficient than passing the function itself.
There are situations where the function comes from outside the
Gridapp. (There is an analogy here with SQL or Apache Pig.) For
those situations, there is no standard option. You could encode
the function as Scheme, pass it to the data Process, and interpret
it there. Then, return a sexp.
Caution! A single naive query can gridlock your cluster with its
many-many-gigabyte sexp result.
Anyone who has worked with relational databases has at some time
ordered a query like "select * from a, b" that fails to qualify
the join. To reiterate points made in earlier answers, you should
challenge your assumptions, and see if you instead structure your
ecosystem around well-defined apis.
Q: "How does Cthulix compare to Plan9?"
Both are Single System Image architectures.
Plan9 emphasises threads, whereas Cthulix emphasises async
coroutines.
Plan9 is a true operating system. It has seen a lot of driver
improvement in recent years, but its driver support does not come
close to Linux. Cthulix sits on Linux, so you can deploy to the
metal and find drivers for high-performance ethernet adapters and
gpus.
Both Plan9 and Cthulix have a console environment against a
filesystem-like structure. The plan9 console is powerful and
expressive. The Cthulix console is not. It lets you navigate the
tree, call exposed-methods from grid-apps, inspect the output, and
nothing else. (Cthulix encourages devs to implement algorithms in
their codebase, rather than in scripts.)
Plan9 emphasises grid communication via synthetic file
descriptors. Cthulix emphasises grid communication via typed
messages.
Plan9 accommodates shared filesystems. Cthulix discourages shared
filesystems as it is difficult to enforce contracts against the
contents of files. Explicit api relationships are better.
Plan9 is self-hosting. Cthulix is not. You develop Cthulix systems
from a mainstream OS.
Plan9 has a desktop. Cthulix does not.
Q: "What about [third-party work-scheduler]?"
(Someone who wants to grok Cthulix may ask to compare it to an
already-familiar platform.)
Network programming is a specialist skill. People who do not have
it are sensibly wary of it.
As a result, there is a market for third-party work-schedulers
that allow operators to manage queues of work to computer clusters
without having to write netcode.
But committing to a third-party work scheduler sets you up for
disappointment. Inevitably, you will find that it lacks features
or flexibility that you need.
The author's experience is that you should seek to own the
scheduling code for any non-trivial workload. Until now, this has
meant building it yourself.
Cthulix will make network communication straightforward. This
allows developers to focus on their work-scheduling algorithms.
Q: "Does Cthulix align with 'microservices'?"
No.
Microservice convention is to create multilateral APIs. For
example: you may have an authentication-service api, with several
departments talking to it.
Each time you want to change a multilateral api, you need to have
a meeting or email thread including partners who may be affected.
Effects of this,
It discourages ambitious change.
It encourages hacks similar to the "add a column on the end"
culture commonly used with relational databases.
You will hire people who like meetings and scare off talent.
In Cthulix, each External Communication relationship should be a
bilateral contract between the api provider and consumer. Now,
developers can coordinate changes in a phone call.
Sometimes a provider will have several clients which have
identical needs at a point of time. The key concern is that you
should be able to upgrade one of them without upgrading all of
them. The contract you have with Party B should not be affected by
any change you are making with Party A.
Changes may require some 'grandfathering' of other users. That is
healthy. It allows you to roll out system changes over several
weeks with your partners. This can be managed as a background
activity, with full engineer agency.
Q: "What about Kubernetes?"
K8s can be used to abstracts deployment and resource management
concerns away from the developer. Instead of having an operations
team responsible for the deployment, the developer deploys it
themselves.
This suits the algorithms-first school of system design. Imagine a
developer who has reached a milestone with their new software
system,
"Hey, I have been building a thing, and it now works on my
machine. How do I deploy it?"
Docker and K8s have seen rapid adoption because they answer that
question.
If you use K8s like that, it will lead to the algorithms-first
mess warned about early in this paper.
K8s has a strong story for cloud providers. It allows them to
policy-limited compute capacity to customers, using Linux Cgroups.
Those nodes ('pods') could participate in a Cthulix System.
There is potential to extend the Cthulix Kernel to integrate to
the Kubernetes API in order to support dynamic compute management
("Expand/Contract Computing"). During bursts, you could scale up
the number of nodes you were hosting, and then wind them down
again later, keeping your cloud costs down. (This is not on the
roadmap at the time of writing)
The Linux mechanism that docker is built on is called Cgroups.
This is valuable and Cthulix will probably incorporate it later on
although it is not in the roadmap.
Q: "What about Python?"
It would be straightforward to build a python3/asyncio gridapp on
the Cthulix foundation. Numpy is an efficient, high-quality
number-crunching library. Through Cthulix, you could install data
into cluster ram, and then dispatch work towards python processes
runing numpy.
Q: "What about Julia?"
As for Python, it would be straightforward.
Q: "What about Pandas or Dask?"
There is a class of tools that allow quants to arrange data and
analyse data, and quickly test hypotheses. Pandas and Dask fit
here.
Cthulix could complement these tools,
You could build a Cthulix Gridapp that offered Pandas or Dask
compute to quants who work from a Linux workstation or
console.
You could build your data aggregation arrangements in Cthulix,
and then have separate Pandas/Dask systems access these as
data sources.
Caution! Exploratory tools for quants need to be applied with
care. Without discipline, their use steers to the algorithms-first
evolutionary path.
Pandas data frames are less efficient than equivalent numpy
representations. They are useful for exploratory work, but may not
be ideal for jobs that run regularly.
Much of the reply to "What about [third-party work-scheduler]?"
applies to Dask.
Q: "What about message queues?"
GACP recommends against use of message queues like tibco and
kafka.
Some reasons,
They complicate message contracts. (Does the sender or
receiver own them, or a separate team? How many people do you
need to get to a telephone call to investigate them?)
They are at constant risk of protocol mismatch issues.
They don't support bidirectional interaction.
They complicate system restarts/upgrades.
They need to be maintained.
They are vulnerable of running out of capacity, sometimes
because of a misbehaving system being run by another
department.
Inexperienced developers are prone to using them as databases.
They introduce software dependency
They add extra hops to troubleshooting,
They add latency.
Instead, use Well-Defined APIs. If somebody is receiving messages
faster than they can process, then they should either (1) buffer
the payloads within infra that they control or (2) revisit the
contract.
Message queues get a lot of use in industry because they give you
access to network transit without network programming. Cthulix
makes network programming accessible.
Q: "What about Erlang?"
The Actor model of concurrency has a lot of mind-share, but the
author doesn't rate it as a paradigm.
When you are building a distributed system, these are key
concerns,
Achieving reliable single-message delivery in the absence of
hardware failure.
Mitigating hardware failure when it happens
The actor model avoids both of these issues by effectively
declaring message delivery to be unreliable in that paradigm. If
an actor crashes while it is handling a message, the message is
lost.
This pushes responsibility for those concerns to the application
layer, where it is a constant distraction and arduous to test. In
this way, the actor model dodges problems that are inherent to its
chosen domain.
Cthulix communicates via checked message-sends. In normal
operation, message delivery is reliable. When a Process vanishes,
(1) the sender gets a clear exception and (2) the Ring-local Init
process learns about the problem and has an opportunity to take
remedial action.
Some domains are better suited to unchecked message sends. For
example, market data diffs. If you need these in your Cthulix
application, you can create multicast senders/receivers separate
to the core messaging layer.
Q: "What about the sequencer architecture?"
It would be straightforward to implement a sequencer-architecture
platform upon Cthulix, and an excellent choice for some domains.
Some commonly-licencesed sequencer systems are Island/INET. These
have weaknesses in their deployment model,
Core code is managed separately to deployment code, with a
large amount of engineered coincidence. How this manifests:
morning startup breaks because one of the classes has been
instantiated incorrectly.
Failover logic is separated from core logic. How this
manifests: operators need to manually intervene in failovers,
which can create gaps of a couple of minutes.
Differences if you used Cthulix as your foundation,
Island has a set of functionality kept in 'commands files'. In
Cthulix, this functionality is part of the same codebase as
application code.
Island manages resources statically, through operator-edited
configuration. Cthulix manages resources dynamically, in its
kernel.
Benefits of a Cthulix-based sequencer arch,
It is straightforward for a developer with no access to the
deployment environment to validate a deployment configuration.
Code and deployment is plainly versioned inside a single
codebase.
If you wanted to have your Cthulix Init process switch off a
troubled server via a remote-hands api, you could do that.
You could have your Cthulix Init service automatically execute
a failover.
System startup logging naturally appears alongside core
logging. There is no inherent separation between system init
and application init, whereas Island does separate these.
You could have your Init service generate the cisco
configuration for external communications. Again, such
functionality reduces your potential failover downtime.
There are certain applications where the sequencer architecture is
not the natural choice. For example, when implementing trading
systems that need to be aware of overnight positions, such as
Good-Til-Cancel orders - you need to carefully maintain that state
somewhere.
In Cthulix, it is straightforward to create separate gridapps
within a single codebase, and to ensure that they are atomically
deployed in concert with one another.
Hence, you could create a Good-Til-Cancel management subsystem as
part of your system, storing data on the disks of a couple of
resource-labelled servers.
Cthulix organises systems into well-defined deployments. This
avoids the situation where the original system sets the rules to
suit it, leading to neglected side systems growing up around it.
In the Cthulix model, all systems are peers within a common
infrastructure.