cthulix.com index preview status preview Cthulix: the future of distributed-computing. Pitch (Cthulix <=> MPI) is as (Unix <=> Multics) Introduction Think of Cthulix as a toolkit for building event-driven distributed systems that scale well. Conventional techniques grow infrastructure around algorithms. In Cthulix, algorithms grow into an infrastructure. Build To Scale This is how most businesses create software systems, 1. Algorithms. A vanguard developer or quant prototypes some algorithm code, and then iterates this towards a solution. 2. Deployment. They pass the software to another team, who are responsible for deploying it. 3. Integration. Interactions with other systems are an adhoc afterthought. The result is unwieldy, operationally inefficient, and scales poorly. (See "Machine Learning: The High Interest Credit Card of Technical Debt", https://research.google/pubs/pub43146/) Operations teams erect scheduling, configuration, build and monitoring systems as an exoskeleton around algorithm code. But those arrangements are fundamental to a platform, and should be the skeleton. Algorithm code should be citizens of the infra, not directors of it. The Cthulix approach, 1. Deployment. Cthulix provides a clear model for boot, init, monitoring, configuration, upgrade and resource management. 2. Integration. Cthulix sets clear rules about interactions between systems. 3. Algorithms. Developers and quants build code to live within this paradigm. Cthulix guides developers to build systems that scale well and behave as good neighbours. Quants and developers benefit from this. Their systems scale better, they don't get weighed down running quirky systems of their own creation. Technology Overview Each Cthulix System is a distributed computer that operators can ssh into. The Shell The shell resembles Bourne Shell. You can navigate using cd and ls, you can ./run_commands. There is a mount point for interacting with the Kernel. Grid-applications (Gridapps) are mounted as directories in the Cthulix filesystem, and expose commands to the shell. APIs Developers can expose functionality to third-party systems as async protocols. The Kernel The Kernel tracks the resources of associated physical hosts - RAM, cpus, gpus, etc. Gridapps request resources from the Kernel. A Cthulix system is divided into Rings, Kernel Ring. This manages hardware allocation, service discovery, permissions, the mount tree, and hosts the shell. Gridapp Ring. Each Gridapp has its own Ring. The Kernel launches a Gridapp by installing the Gridapp's Init service into a compute node. The Init service requests resources from the Kernel, and raises and tends Processes within its Gridapp. The Kernel ring also supplies a logging backbone, a monitoring service, and maintains the shell. The Cthulix Kernel is rafted so that a cluster can survive the unexpected loss of a physical host. Gridapps can coexist within a Cthulix System. Operators can tell the Kernel to add and remove compute during normal system operation. Under the hood Each Cthulix host runs a stripped install of Linux. The Linux kernel has introduced io_uring in 2020. This overcomes historic async deficiencies of that platform. The reference implementation is built with Arch Linux, but you could build a BSD container if you wanted to. (You would prepare your own TFTP/PXE image.) Each logical core in the cluster can host a pegged Linux process. This is a Process. As each Process starts, it registers with the Kernel (if it is in the Kernel Ring, or an Init process) or with its Init process (in other cases). Within a Cthulix System, Processes talk to one another using typed messages. Cthulix provides this mechanism, the developer controls the schema. Each Cthulix System has a skeleton of standard services: Init, Monitoring, Analysis, Logging, Configuration. You can override the system to use your own implementations. The main language is Rust. Rust added coroutines in late 2019. This enables something new: coroutine-based development in a performant, gc-free systems programming language. Requirements for Cthulix: a TFTP server, a set of physical hosts configured to PXE-boot, a network backbone connecting the hosts. The project will supply detailed documentation to help you set this up, but you are free to substitute in your own arrangements if you want to. Hardware is not necessarily tied to a single Cthulix System. You could configure a physical host to be split across several Cthulix Systems. (e.g. for blue/green deployment.) The Standard The modestly titled 'Generally Accepted Cthulix Principles' (GACP) lays out the design rules. This outlines the domain language, sets rules for message-contracts, and mandates the shape of a standard deployment. It does not care about parentheses-placement or unit test coverage. Industry Language People involved in distributed computing use a nuanced language - HPC, HTC, grid computing, concurrent computing, parallel computing, cluster computing, supercomputer. Citation-heavy Wikipedia entries encourage the reader to view each category as a necessary specialisation. This author disagrees. At this time, commodity hardware has a major role to play in all those domains. Linux has a major role to play in all those domains. The separations persist because there is no effective governance in the way that teams build grid software. Decades ago, there was a meaningful difference between 'server' and 'workstation'. Around 2000, commodity hardware and Linux became good-enough that the distinction became academic. Cthulix will cause a similar shakeup in the domain language around distributed computing. Compare to Standard Approaches Let's discuss three approaches that you could use to build a distributed system in 2020. Third-party scheduler and NFS. * Easy to set up, operations-light * Weak at managing jobs with varied needs, runtime. * Poor for workloads with data-locality needs. * Poor for workloads involving multi-stage queries. * Vulnerable to IO bottle-necking as you scale. Apache ecosystem. (Hadoop, Spark, Fink, Kafka, Zookeeper, etc) * Complex, ever-changing ecosystem. * Difficult to reason about what the computer is doing. (JVM) * Moderate hardware efficiency is achievable. * Brittle contracts between constituent systems. * Operations-heavy. MPI-based architecture, commonly used in supercomputers, * High barrier to entry. * Experienced developers achieve strong hardware efficiency. * Complex API Now, the candidate system. Cthulix-based development, * Easy to set up, operations-light * One developer can fit the whole model in their head * One operations engineer can fit the whole model in their head * Clear path to strong hardware utilisation * Malleable. Grow adaptable systems that solve many problems. Highlights of the Cthulix model Network programming is a hard-won, specialist skill. Cthulix gives devs powerful, async network mechanisms through a straightforward API and spares them needing to become network programming gurus. Due to its pure async model, a Cthulix developer can still reason about the whole system, down to cache behaviour on individual processes. Cthulix allows you to mix different job profiles in a single hardware cluster (e.g. gpu-heavy + cpu-intensive large jobs + high-volume, latency-sensitive ram-large jobs). Standard-library features. e.g. for storing and filtering data. Redeployment is atomic, fast and reliable. Cthulix is operations-efficient. An operator who understands the Cthulix model can maintain a variety of systems built in that tradition, and can discover a system that they have not seen before. Due to its simple model, you can expect to get high value out of your operations staff. As your Cthulix infra grows, you benefit from economies-of-scale. The Author The author has a decade of experience building platforms for trading firms, and worked as a custom-shop developer for the decade before that. When building a system, his focus goes to its delivery pipeline and deployment model. There is an analogy with how you assess a house. The obvious focus is kitchens and bathrooms. Old hands focus on foundations and drainage. He consults. Write to cratuki at the search engine mail service. London. Zoom fine. -------------------------------------------------------------------------- Queries Q: "What is 'Init'?" Startup and temporal scheduling. In modern unix, you have (bsd-init + cron). In Cthulix, we have a single concept that covers those things, and we call that Init. Q: "Can you contrast MPI and Cthulix?" MPI is shallow but wide. Shallow: it is focused on messaging only. Wide: it supports a diverse set of use-cases. Cthulix is deep but narrow. Deep: it has a view on build pipeline, programming language, deployment model, and supplies a skeleton of standard services for logging and analytics, and a console. And, it offers messaging functionality. Narrow: at each level, it focuses on a small number of powerful techniques. MPI is not opinionated. It gives developers the widest possible range of options for how systems should interact with one another. Cthulix is opinionated. It gives clear guidance about how systems should interact. Let's explore that, by working through the Cthulix guidance. Cthulix distinguishes between these interactions, 1/ Communication that happens between Processes that are part of the same Cthulix Ring instance. ("Internal Communication") 2/ All other communication. ("External Communication") Building on this, it directs the developer on how Message Contracts should work, 1/ All "Internal Communication" methods are defined in a typed messaging schema that is Ring-specific. These nodes will never suffer protocol-version confusion, because nodes on the same ring are always running the same software version. 2/ All "External Communication" must be done through a Well-Defined API. 3/ Well-Defined APIs should be established as a /bilateral/ contract between the participating deployment units. Here are some features that MPI supports, which are incompatible with Cthulix guidance, 'Dynamic process management'. This would be inconsistent with Cthulix Message Contract rules. A Cthulix developer should never need to do this: (1) Cthulix Rings and Init already give nodes the network connections they need; (2) Having external systems reaching past that model would be chaotic. 'Shared Memory Programming'. Shared state leads to complexity and bugs. In Cthulix, the developer would use message-driven concurrency to achieve the same goals. The message-driven approach is easier to reason about, and the result would be easier to maintain. Cthulix has the advantage of vertical integration. Since it has a strong view on the design of the Deployment, it is able to set tight rules for Message Contract. It seeks to offer a small number of powerful techniques that cooperate well. There may be scenarios where an optimal MPI solution will slightly outperform an optimal Cthulix solution. But the Cthulix solution will be faster to build, faster to iterate on, less vulnerable to bugs, and better to support. This may resemble the assembly/C dynamic. Hypothetically, a well written assembly program is faster. But even people who care a lot about speed prefer to work in C. Q: "Hey - Cthulix is not really an operating system, right?" It is a Single System Image (SSI), but nobody outside of academia uses that term. Developing against Cthulix feels like developing against a conventional OS system api, but everything is async, networking is easier, and you are free of single-host assumptions. When you ssh into a Cthulix system, it feels like you are interacting with a single coherent system, even though that machine may be spread across hundreds of physical hosts. Q: "Why are you using Linux?" In early 2020, the author built a small protected mode kernel to explore the extreme option of building an asynchronous kernel in assembler and C. Through this work, he realised that driver coverage is an insurmountable obstacle to outsider operating systems at this time. GPUs are significant to the grid-computing space, and the only viable options at this time are Windows and Linux. (Apple looks likely to become an option, also.) Once you accept that, it is inevitable that Linux must play a role in this platform. Linux is free-to-use and flexible. It has a full-featured tcp/ip stack. It allows processes to have dedicated cores. As io_uring improves, the platform becomes capable as a true async platform. Hence, Cthulix uses Linux as its 'driver layer'. Physical hosts network-boot a stripped Linux image that contains just enough logic to attach itself to a Cthulix grid. Q: "What is released so far?" So far, there has been no release of code. Q: "What are the license arrangements?" Intent is to License Cthulix under GPL2. Custody of the Cthulix system will gradually move to cthulix.org. Cthulix.com will remain dedicated to commercial consulting around related topics. Q: "Why coroutines rather than threads?" 1. On recent hardware, threads have a performance disadvantage because CPU cache has grown important for performance, and thread-switching invalidates your CPU cache. (This is not explicitly true, but is true for practical purposes) 2. It is easier to reason about what is happening on the metal in systems built with asynchronous coroutines than in multithreaded systems. 3. Coroutine code reads easily. Q: "Is this like Apache Yarn?" There is overlap. Apache Hadoop is built on top of a separate layer called Yarn. Yarn is responsible for resource management, and node control. Similarities, Both offer flexible resource management. Both are multi-tenant. (e.g. you can run several gridapps in a single Cthulix System) Where Cthulix has a 'Gridapp Init Service', Yarn has an 'Application Submission Client'. You could build something Cthulix-like on top of Yarn. But that system would be tied to the JVM ecosystem. Developers would not be able to program using coroutines, and it would take more effort to reason about what was running on the metal than in the (Rust) Cthulix model. Q: "Is there some equivalent of HDFS in Cthulix?" No. Brace yourself for a heresy. The things that we know as Files are generally regarded as a multilateral contract. Cthulix is all about bilateral contracts, and hence discourages Files and multiparty filesystems. Rather than having a filesystem, you would create an ETL that had multiple inputs (one from each producer) and multiple outputs (one to each consumer). Future work: build a gridapp that makes it easy to create bilateral File-like things. Q: "What if I need to resource-manage custom FPGA hardware?" That is fine. The Kernel contains a resource manager. Via the API, you can customise it to track new types of resource. Or, you can replace the resource manager module in the kernel with your own implementation. This applies broadly in Cthulix. If you want to replace the logging subsystem, you can switch it out for your own module that satisfies the same interface. Q: "What Parallel Programming Model does Cthulix use?" None. Cthulix provides the infrastructure, and then the developer implements the PPM they want. The end-developer makes algorithm choices. With that said, the Cthulix design will steer developers towards message-passing approaches. A key dangers when building distributed systems is of IO bottle-necking. A developer who keeps close to the messages is less vulnerable to this. Still, if you wanted to build a map-reduce like thing on Cthulix, you could do this. Q: "What do you think about the Dursi paper?" Jonathan Dursi's paper, "HPC is dying, and MPI is killing it" (https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html) may be the most influential thing to be written about distributed compute in the last decade. It is critical of message-passing as a general approach to distributed computer problems. We contend that (1) an expressive and macro-capable language like Rust restores relevance to message-passing techniques and (2) coroutines+exceptions supply a clear path to working yourself out of single-host failure scenarios. Q: "Does Cthulix support [active messages / lambda-passing]?" Not currently. There is less need for it in this architecture than some, because every Process in a Gridapp has the same code. If the instruction comes from within the Gridapp, you can enumerate the function to be run, which is more efficient than passing the function itself. There are situations where the function comes from outside the Gridapp. (There is an analogy here with SQL or Apache Pig.) For those situations, there is no standard option. You could encode the function as Scheme, pass it to the data Process, and interpret it there. Then, return a sexp. Caution! A single naive query can gridlock your cluster with its many-many-gigabyte sexp result. Anyone who has worked with relational databases has written a query like "select * from a, b" where they have forgotten to qualify the join. To reiterate points made in earlier answers, you should challenge your assumptions, and see if you instead structure your ecosystem around well-defined apis. Q: "How does Cthulix compare to Plan9?" Both are Single System Image architectures. Plan9 emphasises threads, whereas Cthulix emphasises async coroutines. Plan9 has seen a lot of driver improvement in recent years, but its driver support does not come close to Linux. Cthulix is on Linux, so you can deploy to the metal and find drivers for high-performance ethernet adapters and gpus. Both Plan9 and Cthulix have a console environment against a filesystem-like structure. The plan9 console is powerful and expressive. The Cthulix console is not. It lets you navigate the tree, call exposed-methods from grid-apps, inspect the output, and nothing else. Plan9 emphasises grid communication via synthetic file descriptors. Cthulix emphasises grid communication via typed messages. Plan9 accommodates shared filesystems. Cthulix discourages shared filesystems as it is difficult to enforce contracts against the contents of files. Explicit api relationships are better. Plan9 is self-hosting. Cthulix is not. You develop Cthulix systems from a mainstream OS. Plan9 has a desktop. Cthulix does not. Q: "What about [third-party work-scheduler]?" (Someone who wants to grok Cthulix may ask to compare it to an already-familiar platform.) Network programming is a specialist skill. People who do not have it are sensibly wary of it. As a result, there is a market for third-party work-schedulers that allow operators to manage queues of work to computer clusters without having to write netcode. But committing to a third-party work scheduler sets you up for disappointment. Inevitably, you will find that it lacks features or flexibility that you need. The author's experience is that you should seek to own the scheduling code for any non-trivial workload. Until now, this has meant building it yourself. Now, Cthulix makes network communication straightforward, whilst leaving the work scheduling algorithm to the developer. Q: "Does Cthulix align with 'microservices'?" No. Microservice convention is to create multilateral APIs. For example: you may have an authentication-service api, with several departments talking to it. Each time you want to change a multilateral api, you need to have a meeting or email thread including partners who may be affected. Effects of this, It discourages ambitious change. It encourages hacks similar to the "add a column on the end" culture used in relational databases. It encourages you to hire people who will participate in the meetings, which waters down engineer agency. In Cthulix, each External Communication relationship should be a bilateral contract between the api provider and consumer. Now, developers can coordinate changes in a phone call. Sometimes a provider will have several clients which have identical needs at a point of time. The key concern is that you should be able to upgrade one of them without upgrading all of them. The contract you have with Party B should not be affected by any change you are making with Party A. Changes may require some 'grandfathering' of other users. That is healthy. It allows you to roll out system changes over several weeks with your partners. This can be managed as a background activity, with full engineer agency. Q: "What about Kubernetes?" K8s can be used to abstracts deployment and resource management concerns away from the developer. Instead of having an operations team responsible for the deployment, the developer deploys it themselves. This suits the algorithms-first school of system design. Imagine a developer who has reached a milestone with their new software system, "Hey, I have been building a thing, and it now works on my machine. How do I deploy it?" Docker and K8s have seen rapid adoption because they answer that question. If you use K8s like that, it will lead to the algorithms-first mess warned about early in this paper. K8s has a strong story for cloud providers. It allows them to policy-limited compute capacity to customers, using Linux Cgroups. Those nodes ('pods') could participate in a Cthulix System. There is potential to extend the Cthulix Kernel to integrate to the Kubernetes API in order to support dynamic compute management ("Expand/Contract Computing"). During bursts, you could scale up the number of nodes you were hosting, and then wind them down again later, keeping your cloud costs down. (This is not on the roadmap at the time of writing) Q: "What about Python?" It would be straightforward to build a python3/asyncio gridapp on the Cthulix foundation. Numpy is an efficient, high-quality number-crunching library. Through Cthulix, you could install data into cluster ram, and then dispatch work towards python processes runing numpy. Q: "What about Julia?" As for Python, it would be straightforward. Q: "What about Pandas or Dask?" There is a class of tools that allow quants to arrange data and analyse data, and quickly test hypotheses. Pandas and Dask fit here. Cthulix could complement these tools, You could build a Cthulix Gridapp that offered Pandas or Dask compute to quants who work from a Linux workstation or console. You could build your data aggregation arrangements in Cthulix, and then have separate Pandas/Dask systems access these as data sources. Caution! Exploratory tools for quants need to be applied with care. Without discipline, their use steers to the algorithms-first evolutionary path. Pandas data frames are less efficient than equivalent numpy representations. They are useful for exploratory work, but may not be ideal for jobs that run regularly. Much of the reply to "What about [third-party work-scheduler]?" applies to Dask. Q: "What about message queues?" GACP recommends against use of message queues like tibco and kafka. Some reasons, They complicate message contracts. (Does the sender or receiver own them, or a separate team? How many people do you need to get to a telephone call to investigate them?) They are at constant risk of protocol mismatch issues. They don't support bidirectional interaction. They complicate system restarts/upgrades. They need to be maintained. They are vulnerable of running out of capacity, sometimes because of a misbehaving system being run by another department. Inexperienced developers are prone to using them as databases. They introduce software dependency They add extra hops to troubleshooting, They add latency. Instead, use Well-Defined APIs. If somebody is receiving messages faster than they can process, then they should either (1) buffer the payloads within infra that they control or (2) revisit the contract. Message queues get a lot of use in industry because they give you access to network transit without network programming. Cthulix makes network programming accessible. Q: "What about Erlang?" The Actor model of concurrency has a lot of mind-share, but the author doesn't rate it as a paradigm. When you are building a distributed system, these are key concerns, Achieving reliable single-message delivery in the absence of hardware failure. Mitigating hardware failure when it happens The actor model avoids both of these issues by effectively declaring message delivery to be unreliable in that paradigm. If an actor crashes while it is handling a message, the message is lost. This pushes responsibility for those concerns to the application layer, where it is a constant distraction and arduous to test. In this way, the actor model dodges problems that are inherent to its chosen domain. Cthulix communicates via checked message-sends. In normal operation, message delivery is reliable. When a Process vanishes, (1) the sender gets a clear exception and (2) the Ring-local Init process learns about the problem and has an opportunity to take remedial action. Some domains are better suited to unchecked message sends. For example, market data diffs. If you need these in your Cthulix application, you can create multicast senders/receivers separate to the core messaging layer. Q: "What about the sequencer architecture?" It would be straightforward to implement a sequencer-architecture platform upon Cthulix, and a natural complement. There are some notable weaknesses in the traditional deployment model of the Island/INET tradition, Core code is managed separately to deployment code, with a large amount of engineered coincidence. How this manifests: morning startup breaks because one of the classes has been instantiated incorrectly. Failover logic is separated from core logic. How this manifests: operators need to manually intervene in failovers, which can create gaps of a couple of minutes. Differences if you used Cthulix as your foundation, Island has a set of functionality kept in 'commands files'. In Cthulix, this functionality is part of the same codebase as application code. Island manages resources statically, through operator-edited configuration. Cthulix manages resources dynamically, in its the kernel. Benefits of a Cthulix-based sequencer arch, It is straightforward for a developer with no access to the deployment environment to validate a deployment configuration. Code and deployment is plainly versioned inside a single codebase. If you wanted to have your Cthulix Init process switch off a troubled server via a remote-hands api, you could do that. You could have your Cthulix Init service automatically execute a failover. System startup logging naturally appears alongside core logging. There is no inherent separation between system init and application init, whereas Island does separate these. You could have your Init service generate the cisco configuration for external communications. Again, such functionality reduces your potential failover downtime. There are certain applications where the sequencer architecture is not the natural choice. For example, when implementing trading systems that need to be aware of overnight positions, such as Good-Til-Cancel orders, you need to carefully maintain that state somewhere. In Cthulix, it is straightforward to create separate gridapps within a single codebase, and to ensure that they are atomically deployed in concert with one another. Hence, you could create a Good-Til-Cancel management subsystem as part of your system, storing data on the disks of a couple of resource-labelled servers. That subsystem could still feel like a full member of the same codebase and deployment arrangements as your sequencer-based Exchange and Smart Order Router subsystems.