cthulix.com index preview status consulting


preview

    Cthulix: the future of distributed-computing.


    Pitch

        (Cthulix <=> MPI) is as (Unix <=> Multics)


    Introduction

        Think of Cthulix as a toolkit for building event-driven distributed
        systems that scale well.

        Conventional techniques grow infrastructure around algorithms. In
        Cthulix, algorithms grow into an infrastructure.


    Build To Scale

        This is how most businesses create software systems,

            1. Algorithms. A vanguard developer or quant prototypes some
            algorithm code, and then iterates this towards a solution.

            2. Deployment. They pass the software to another team, who are
            responsible for deploying it.

            3. Integration. Interactions with other systems are an adhoc
            afterthought.

        The result is unwieldy, operationally inefficient, and scales poorly.
        (See "Machine Learning: The High Interest Credit Card of Technical
        Debt", https://research.google/pubs/pub43146/)

        Operations teams erect scheduling, configuration, build and monitoring
        systems as an exoskeleton around algorithm code. But those
        arrangements are fundamental to a platform, and should be the
        skeleton.

        Algorithm code should be citizens of the infra, not directors of it.

        The Cthulix approach,

            1. Deployment. Cthulix provides a clear model for boot, init,
            monitoring, configuration, upgrade and resource management.

            2. Integration. Cthulix sets clear rules about interactions
            between systems.

            3. Algorithms. Developers and quants build code to live within
            this paradigm.

        Cthulix guides developers to build systems that scale well and behave
        as good neighbours.

        Quants and developers benefit from this. Their systems scale better,
        they don't get weighed down running quirky systems of their own
        creation.


    Technology Overview

        Each Cthulix System is a distributed computer that operators can ssh
        into.

        The Shell

            The shell resembles Bourne Shell. You can navigate using cd and
            ls, you can ./run_commands.

            There is a mount point for interacting with the Kernel.

            Grid-applications (Gridapps) are mounted as directories in the
            Cthulix filesystem, and expose commands to the shell.

        APIs

            Developers can expose functionality to third-party systems as
            async protocols.

        The Kernel

            The Kernel tracks the resources of associated physical hosts -
            RAM, cpus, gpus, etc.
            
            Gridapps request resources from the Kernel.

            A Cthulix system is divided into Rings,

                Kernel Ring. This manages hardware allocation, service
                discovery, permissions, the mount tree, and hosts the shell.

                Gridapp Ring. Each Gridapp has its own Ring.

            The Kernel launches a Gridapp by installing the Gridapp's Init
            service into a compute node.

            The Init service requests resources from the Kernel, and raises
            and tends Processes within its Gridapp.

            The Kernel ring also supplies a logging backbone, a monitoring
            service, and maintains the shell.

            The Cthulix Kernel is rafted so that a cluster can survive the
            unexpected loss of a physical host.

            Gridapps can coexist within a Cthulix System.

            Operators can tell the Kernel to add and remove compute during
            normal system operation.

        Under the hood

            Each Cthulix host runs a stripped install of Linux. The Linux
            kernel has introduced io_uring in 2020. This overcomes historic
            async deficiencies of that platform.
            
            The reference implementation is built with Arch Linux, but you
            could build a BSD container if you wanted to. (You would prepare
            your own TFTP/PXE image.)

            Each logical core in the cluster can host a pegged Linux process.
            This is a Process. As each Process starts, it registers with the
            Kernel (if it is in the Kernel Ring, or an Init process) or with
            its Init process (in other cases).

            Within a Cthulix System, Processes talk to one another using typed
            messages. Cthulix provides this mechanism, the developer controls
            the schema.

            Each Cthulix System has a skeleton of standard services: Init,
            Monitoring, Analysis, Logging, Configuration. You can override the
            system to use your own implementations.

            The main language is Rust. Rust added coroutines in late 2019.
            This enables something new: coroutine-based development in a
            performant, gc-free systems programming language.

            Requirements for Cthulix: a TFTP server, a set of physical hosts
            configured to PXE-boot, a network backbone connecting the hosts.
            The project will supply detailed documentation to help you set
            this up, but you are free to substitute in your own arrangements
            if you want to.

            Hardware is not necessarily tied to a single Cthulix System. You
            could configure a physical host to be split across several Cthulix
            Systems. (e.g. for blue/green deployment.)


    The Standard

        The modestly titled 'Generally Accepted Cthulix Principles' (GACP)
        lays out the design rules. This outlines the domain language, sets
        rules for message-contracts, and mandates the shape of a standard
        deployment.

        It does not care about parentheses-placement or unit test coverage.


    Industry Language

        People involved in distributed computing use a nuanced language - HPC,
        HTC, grid computing, concurrent computing, parallel computing, cluster
        computing, supercomputer.

        Citation-heavy Wikipedia entries encourage the reader to view each
        category as a necessary specialisation.

        This author disagrees. At this time, commodity hardware has a major
        role to play in all those domains. Linux has a major role to play in
        all those domains. The separations persist because there is no
        effective governance in the way that teams build grid software.

        Decades ago, there was a meaningful difference between 'server' and
        'workstation'. Around 2000, commodity hardware and Linux became
        good-enough that the distinction became academic.

        Cthulix will cause a similar shakeup in the domain language around
        distributed computing.


    Compare to Standard Approaches

        Let's discuss three approaches that you could use to build a
        distributed system in 2020.

        Third-party scheduler and NFS.
            * Easy to set up, operations-light
            * Weak at managing jobs with varied needs, runtime.
            * Poor for workloads with data-locality needs.
            * Poor for workloads involving multi-stage queries.
            * Vulnerable to IO bottle-necking as you scale.

        Apache ecosystem. (Hadoop, Spark, Fink, Kafka, Zookeeper, etc)
            * Complex, ever-changing ecosystem.
            * Difficult to reason about what the computer is doing. (JVM)
            * Moderate hardware efficiency is achievable.
            * Brittle contracts between constituent systems.
            * Operations-heavy.

        MPI-based architecture, commonly used in supercomputers,
            * High barrier to entry.
            * Experienced developers achieve strong hardware efficiency.
            * Complex API

        Now, the candidate system.


        Cthulix-based development,
            * Easy to set up, operations-light
            * One developer can fit the whole model in their head
            * One operations engineer can fit the whole model in their head
            * Clear path to strong hardware utilisation
            * Malleable. Grow adaptable systems that solve many problems.


    Highlights of the Cthulix model

        Network programming is a hard-won, specialist skill. Cthulix gives
        devs powerful, async network mechanisms through a straightforward API
        and spares them needing to become network programming gurus.

        Due to its pure async model, a Cthulix dev can still reason about the
        whole system, down to cache behaviour on individual processes.

        Cthulix allows you to mix different job profiles in a single hardware
        cluster (e.g. gpu-heavy + cpu-intensive large jobs + high-volume,
        latency-sensitive ram-large jobs).

        Standard-library features. e.g. for storing and filtering data.

        Redeployment is atomic, fast and reliable.

        Cthulix is operations-efficient. An operator who understands the
        Cthulix model can maintain a variety of systems built in that
        tradition, and can discover a system that they have not seen before.

        Due to its simple model, you can expect to get high value out of your
        operations staff.

        As your Cthulix infra grows, you benefit from economies-of-scale.


    The Author

        The author has a decade of experience building platforms for trading
        firms, and worked as a custom-shop developer for the decade before
        that.

        When designing a system, his focus goes to interactions, delivery and
        deployment, rather than algorithms. There is an analogy with how you
        assess a house. The obvious focus is kitchens and bathrooms. Old hands
        focus on foundations and drainage.

        He consults. Write to cratuki at the search engine mail service.
        London. Zoom fine.

    --------------------------------------------------------------------------

    Queries

        Q: "What is 'Init'?"

            Startup and temporal scheduling.
            
            In modern unix, you have (bsd-init + cron). In Cthulix, we have a
            single concept that covers those things, and we call that Init.


        Q: "Can you contrast MPI and Cthulix?"

            MPI is shallow but wide.
            
                Shallow: it is focused on messaging only.

                Wide: it supports a diverse set of use-cases.

            Cthulix is deep but narrow.
            
                Deep: it has a view on build pipeline, programming language,
                deployment model, and supplies a skeleton of standard services
                for logging and analytics, and a console. And, it offers
                messaging functionality.

                Narrow: at each level, it focuses on a small number of
                powerful techniques.

            MPI is not opinionated.

                It gives developers the widest possible range of options for
                how systems should interact with one another.

            Cthulix is opinionated.

                It gives clear guidance about how systems should interact.

            Let's explore that, by working through the Cthulix guidance.

                Cthulix distinguishes between these interactions,

                    1/ Communication that happens between Processes that are
                    part of the same Cthulix Ring instance. ("Internal
                    Communication")

                    2/ All other communication. ("External Communication")

                Building on this, it directs the developer on how Message
                Contracts should work,

                    1/ All "Internal Communication" methods are defined in a
                    typed messaging schema that is Ring-specific. These nodes
                    will never suffer protocol-version confusion, because
                    nodes on the same ring are always running the same
                    software version.

                    2/ All "External Communication" must be done through a
                    Well-Defined API.

                    3/ Well-Defined APIs should be established as a
                    /bilateral/ contract between the participating deployment
                    units.

            Here are some features that MPI supports, which are incompatible
            with Cthulix guidance,

                'Dynamic process management'. This would be inconsistent with
                Cthulix Message Contract rules. A Cthulix developer should
                never need to do this: (1) Cthulix Rings and Init already give
                nodes the network connections they need; (2) Having external
                systems reaching past that model would be chaotic.

                'Shared Memory Programming'. Shared state leads to complexity
                and bugs. In Cthulix, the developer would use message-driven
                concurrency to achieve the same goals. The message-driven
                approach is easier to reason about, and the result would be
                easier to maintain.

            Cthulix has the advantage of vertical integration. Since it has a
            strong view on the design of the Deployment, it is able to set
            tight rules for Message Contract. It seeks to offer a small number
            of powerful techniques that cooperate well.

            There may be scenarios where an optimal MPI solution will slightly
            outperform an optimal Cthulix solution. But the Cthulix solution
            will be faster to build, faster to iterate on, less vulnerable to
            bugs, and better to support.

            This may resemble the assembly/C dynamic. Hypothetically, a well
            written assembly program is faster. But even people who care a lot
            about speed prefer to work in C.


        Q: "Hey - Cthulix is not really an operating system, right?"

            It is a Single System Image (SSI), but nobody outside of academia
            uses that term.

            Developing against Cthulix feels like developing against a
            conventional OS system api, but everything is async, networking is
            easier, and you are free of single-host assumptions.

            When you ssh into a Cthulix system, it feels like you are
            interacting with a single coherent system, even though that
            machine may be spread across hundreds of physical hosts.


        Q: "What if I want to subdivide my cluster?"

            You can do this.
            
            A typical situation: you want to guarantee that some resources are
            dedicated to a tenant application; but also want other tenants to
            be able to compete for other resources opportunistically.

            Cthulix incorporates a resource manager.


        Q: "Why are you using Linux?"

            The author has experience at building a protected mode kernel in
            assembly and C. You can get fine-grained control through this
            approach. But it has become an impractical approach to
            supercomputer development due to driver coverage.

            GPUs are significant to the grid-computing space, and the only
            viable options at this time are Windows and Linux. (Apple will
            probably establish themselves as a third option.)

            Once you accept that, it is inevitable that Linux must play a role
            in this platform.

            Linux is free-to-use and flexible. It runs on a variety of cpu
            architectures. It has a full-featured tcp/ip stack. It allows
            processes to have dedicated cores. As io_uring improves, the
            platform becomes capable as a true async platform.

            Hence, Cthulix uses Linux as its 'driver layer'. Physical hosts
            network-boot a stripped Linux image that contains just enough
            logic to attach itself to a Cthulix grid.


        Q: "What is released so far?"

            So far, there has been no release of code.
    

        Q: "What are the license arrangements?"

            Intent is to License Cthulix under GPL2.

            Custody of the Cthulix system will gradually move to cthulix.org.
            Cthulix.com will remain dedicated to commercial consulting around
            related topics.


        Q: "Why coroutines rather than threads?"

            1. On recent hardware, threads have a performance disadvantage
            because CPU cache has grown important for performance, and
            thread-switching invalidates your CPU cache. (This is not
            explicitly true, but is true for practical purposes)

            2. It is easier to reason about what is happening on the metal in
            systems built with asynchronous coroutines than in multithreaded
            systems.

            3. Coroutine code reads easily.


        Q: "Is this like Apache Yarn?"

            There is overlap.

            Apache Hadoop is built on top of a separate layer called Yarn.
            Yarn is responsible for resource management, and node control.

            Similarities,

                Both offer flexible resource management.

                Both are multi-tenant. (e.g. you can run several gridapps in a
                single Cthulix System)

                Where Cthulix has a 'Gridapp Init Service', Yarn has an
                'Application Submission Client'.

            You could build something Cthulix-like on top of Yarn. But that
            system would be tied to the JVM ecosystem. Developers would not be
            able to program using coroutines, and it would take more effort to
            reason about what was running on the metal than in the (Rust)
            Cthulix model.


        Q: "Is there some equivalent of HDFS in Cthulix?"

            No.

            Brace yourself for a heresy.

            The things that we know as Files are generally regarded as a
            multilateral contract.

            Cthulix is all about bilateral contracts, and hence discourages
            Files and multiparty filesystems.

            Rather than having a filesystem, you would create an ETL that had
            multiple inputs (one from each producer) and multiple outputs (one
            to each consumer).
            
            Future work: build a gridapp that makes it easy to create
            bilateral File-like things.


        Q: "What if I need to resource-manage custom FPGA hardware?"

            That is fine.

            The Kernel contains a resource manager. Via the API, you can
            customise it to track new types of resource. Or, you can
            replace the resource manager module in the kernel with your
            own implementation.

            This applies broadly in Cthulix. If you want to replace the
            logging subsystem, you can switch it out for your own module that
            satisfies the same interface.


        Q: "What Parallel Programming Model does Cthulix use?"

            None. Cthulix provides the infrastructure, and then the developer
            implements the PPM they want. The end-developer makes algorithm
            choices.

            With that said, the Cthulix design will steer developers towards
            message-passing approaches.

            A key dangers when building distributed systems is of IO
            bottle-necking. A developer who keeps close to the messages is
            less vulnerable to this.

            Still, if you wanted to build a map-reduce like thing on Cthulix,
            you could do this.


        Q: "What do you think about the Dursi paper?"

            Jonathan Dursi's paper, "HPC is dying, and MPI is killing it"
            (https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html)
            may be the most influential thing to be written about distributed
            compute in the last decade.

            It is critical of message-passing as a general approach to
            distributed computer problems. We contend that (1) an expressive
            and macro-capable language like Rust restores relevance to
            message-passing techniques and (2) coroutines+exceptions supply a
            clear path to working yourself out of single-host failure
            scenarios.


        Q: "Does Cthulix support [active messages / lambda-passing]?"

            Not currently.

            There is less need for it in this architecture than some, because
            every Process in a Gridapp has the same code. If the instruction
            comes from within the Gridapp, you can enumerate the function to
            be run, which is more efficient than passing the function itself.

            There are situations where the function comes from outside the
            Gridapp. (There is an analogy here with SQL or Apache Pig.) For
            those situations, there is no standard option. You could encode
            the function as Scheme, pass it to the data Process, and interpret
            it there. Then, return a sexp.

            Caution! A single naive query can gridlock your cluster with its
            many-many-gigabyte sexp result.

            Anyone who has worked with relational databases has written a
            query like "select * from a, b" where they have forgotten to
            qualify the join. To reiterate points made in earlier answers, you
            should challenge your assumptions, and see if you instead
            structure your ecosystem around well-defined apis.


        Q: "How does Cthulix compare to Plan9?"

            Both are Single System Image architectures.
            
            Plan9 emphasises threads, whereas Cthulix emphasises async
            coroutines.

            Plan9 has seen a lot of driver improvement in recent years, but
            its driver support does not come close to Linux. Cthulix is on
            Linux, so you can deploy to the metal and find drivers for
            high-performance ethernet adapters and gpus.

            Both Plan9 and Cthulix have a console environment against a
            filesystem-like structure. The plan9 console is powerful and
            expressive. The Cthulix console is not. It lets you navigate the
            tree, call exposed-methods from grid-apps, inspect the output, and
            nothing else.

            Plan9 emphasises grid communication via synthetic file
            descriptors. Cthulix emphasises grid communication via typed
            messages.

            Plan9 accommodates shared filesystems. Cthulix discourages shared
            filesystems as it is difficult to enforce contracts against the
            contents of files. Explicit api relationships are better.

            Plan9 is self-hosting. Cthulix is not. You develop Cthulix systems
            from a mainstream OS.

            Plan9 has a desktop. Cthulix does not.


        Q: "What about [third-party work-scheduler]?"

            (Someone who wants to grok Cthulix may ask to compare it to an
            already-familiar platform.)

            Network programming is a specialist skill. People who do not have
            it are sensibly wary of it.
            
            As a result, there is a market for third-party work-schedulers
            that allow operators to manage queues of work to computer clusters
            without having to write netcode.

            But committing to a third-party work scheduler sets you up for
            disappointment. Inevitably, you will find that it lacks features
            or flexibility that you need.

            The author's experience is that you should seek to own the
            scheduling code for any non-trivial workload. Until now, this has
            meant building it yourself.

            Now, Cthulix makes network communication straightforward, whilst
            leaving the work scheduling algorithm to the developer.


        Q: "Does Cthulix align with 'microservices'?"

            No.

            Microservice convention is to create multilateral APIs. For
            example: you may have an authentication-service api, with several
            departments talking to it.

            Each time you want to change a multilateral api, you need to have
            a meeting or email thread including partners who may be affected.

            Effects of this,

                It discourages ambitious change.

                It encourages hacks similar to the "add a column on the end"
                culture used in relational databases.

                It encourages you to hire people who will participate in the
                meetings, which waters down engineer agency.

            In Cthulix, each External Communication relationship should be a
            bilateral contract between the api provider and consumer. Now,
            developers can coordinate changes in a phone call.

            Sometimes a provider will have several clients which have
            identical needs at a point of time. The key concern is that you
            should be able to upgrade one of them without upgrading all of
            them. The contract you have with Party B should not be affected by
            any change you are making with Party A.

            Changes may require some 'grandfathering' of other users. That is
            healthy. It allows you to roll out system changes over several
            weeks with your partners. This can be managed as a background
            activity, with full engineer agency.


        Q: "What about Kubernetes?"

            K8s can be used to abstracts deployment and resource management
            concerns away from the developer. Instead of having an operations
            team responsible for the deployment, the developer deploys it
            themselves.

            This suits the algorithms-first school of system design. Imagine a
            developer who has reached a milestone with their new software
            system,

                "Hey, I have been building a thing, and it now works on my
                machine. How do I deploy it?"

            Docker and K8s have seen rapid adoption because they answer that
            question.

            If you use K8s like that, it will lead to the algorithms-first
            mess warned about early in this paper.

            K8s has a strong story for cloud providers. It allows them to
            policy-limited compute capacity to customers, using Linux Cgroups.

            Those nodes ('pods') could participate in a Cthulix System.

            There is potential to extend the Cthulix Kernel to integrate to
            the Kubernetes API in order to support dynamic compute management
            ("Expand/Contract Computing"). During bursts, you could scale up
            the number of nodes you were hosting, and then wind them down
            again later, keeping your cloud costs down. (This is not on the
            roadmap at the time of writing)


        Q: "What about Python?"

            It would be straightforward to build a python3/asyncio gridapp on
            the Cthulix foundation. Numpy is an efficient, high-quality
            number-crunching library. Through Cthulix, you could install data
            into cluster ram, and then dispatch work towards python processes
            runing numpy.


        Q: "What about Julia?"

            As for Python, it would be straightforward.


        Q: "What about Pandas or Dask?"

            There is a class of tools that allow quants to arrange data and
            analyse data, and quickly test hypotheses. Pandas and Dask fit
            here.

            Cthulix could complement these tools,

                You could build a Cthulix Gridapp that offered Pandas or Dask
                compute to quants who work from a Linux workstation or
                console.

                You could build your data aggregation arrangements in Cthulix,
                and then have separate Pandas/Dask systems access these as
                data sources.

            Caution! Exploratory tools for quants need to be applied with
            care. Without discipline, their use steers to the algorithms-first
            evolutionary path.

            Pandas data frames are less efficient than equivalent numpy
            representations. They are useful for exploratory work, but may not
            be ideal for jobs that run regularly.

            Much of the reply to "What about [third-party work-scheduler]?"
            applies to Dask.


        Q: "What about message queues?"

            GACP recommends against use of message queues like tibco and
            kafka.

            Some reasons,

                They complicate message contracts. (Does the sender or
                receiver own them, or a separate team? How many people do you
                need to get to a telephone call to investigate them?)
                
                They are at constant risk of protocol mismatch issues.

                They don't support bidirectional interaction.

                They complicate system restarts/upgrades.
                
                They need to be maintained.

                They are vulnerable of running out of capacity, sometimes
                because of a misbehaving system being run by another
                department.

                Inexperienced developers are prone to using them as databases.

                They introduce software dependency
                
                They add extra hops to troubleshooting,
                
                They add latency.
                
            Instead, use Well-Defined APIs. If somebody is receiving messages
            faster than they can process, then they should either (1) buffer
            the payloads within infra that they control or (2) revisit the
            contract.

            Message queues get a lot of use in industry because they give you
            access to network transit without network programming. Cthulix
            makes network programming accessible.


        Q: "What about Erlang?"

            The Actor model of concurrency has a lot of mind-share, but the
            author doesn't rate it as a paradigm.

            When you are building a distributed system, these are key
            concerns,

                Achieving reliable single-message delivery in the absence of
                hardware failure.

                Mitigating hardware failure when it happens

            The actor model avoids both of these issues by effectively
            declaring message delivery to be unreliable in that paradigm. If
            an actor crashes while it is handling a message, the message is
            lost.

            This pushes responsibility for those concerns to the application
            layer, where it is a constant distraction and arduous to test. In
            this way, the actor model dodges problems that are inherent to its
            chosen domain.

            Cthulix communicates via checked message-sends. In normal
            operation, message delivery is reliable. When a Process vanishes,
            (1) the sender gets a clear exception and (2) the Ring-local Init
            process learns about the problem and has an opportunity to take
            remedial action.

            Some domains are better suited to unchecked message sends. For
            example, market data diffs. If you need these in your Cthulix
            application, you can create multicast senders/receivers separate
            to the core messaging layer.


        Q: "What about the sequencer architecture?"

            It would be straightforward to implement a sequencer-architecture
            platform upon Cthulix, and a natural complement.

            There are some notable weaknesses in the traditional deployment
            model of the Island/INET tradition,

                Core code is managed separately to deployment code, with a
                large amount of engineered coincidence. How this manifests:
                morning startup breaks because one of the classes has been
                instantiated incorrectly.

                Failover logic is separated from core logic. How this
                manifests: operators need to manually intervene in failovers,
                which can create gaps of a couple of minutes.

            Differences if you used Cthulix as your foundation,

                Island has a set of functionality kept in 'commands files'. In
                Cthulix, this functionality is part of the same codebase as
                application code.

                Island manages resources statically, through operator-edited
                configuration. Cthulix manages resources dynamically, in its
                kernel.

            Benefits of a Cthulix-based sequencer arch,

                It is straightforward for a developer with no access to the
                deployment environment to validate a deployment configuration.

                Code and deployment is plainly versioned inside a single
                codebase.

                If you wanted to have your Cthulix Init process switch off a
                troubled server via a remote-hands api, you could do that.

                You could have your Cthulix Init service automatically execute
                a failover.

                System startup logging naturally appears alongside core
                logging. There is no inherent separation between system init
                and application init, whereas Island does separate these.

                You could have your Init service generate the cisco
                configuration for external communications. Again, such
                functionality reduces your potential failover downtime.

            There are certain applications where the sequencer architecture is
            not the natural choice. For example, when implementing trading
            systems that need to be aware of overnight positions, such as
            Good-Til-Cancel orders, you need to carefully maintain that state
            somewhere.

            In Cthulix, it is straightforward to create separate gridapps
            within a single codebase, and to ensure that they are atomically
            deployed in concert with one another.

            Hence, you could create a Good-Til-Cancel management subsystem as
            part of your system, storing data on the disks of a couple of
            resource-labelled servers. That subsystem could still feel like a
            full member of the same codebase and deployment arrangements as
            your sequencer-based Exchange and Smart Order Router subsystems.