April 15, 2024

Slack Connect, AKA shared channels, permits communication between completely different Slack workspaces, by way of channels shared by collaborating organizations. Slack Join has existed for just a few years now, and the sheer quantity of channels and exterior connections has elevated considerably for the reason that launch. The elevated quantity launched scaling issues, but in addition highlighted that not all exterior connections are the identical, and that our prospects have completely different relationships with their companions. We wanted a system that allowed us to customise every connection, whereas additionally permitting admins to simply handle the variety of ever-growing connections and linked channels. The prevailing configuration system didn’t permit customization by exterior connections, and admin instruments weren’t constructed to deal with the ever-growing scale. On this put up, we’ll speak about how we solved these challenges on the backend (the frontend implementation is its personal story, and deserves a separate weblog entry).

Our first try at per-connection configuration

Slack Join was constructed with safety in thoughts. With the intention to set up a shared channel between two organizations, an exterior consumer should first settle for a Slack Join invitation, then the admins on each sides should approve the brand new shared channel, and solely after these steps can the communication start. This works advantageous for one-off channels between two firms, however the handbook approval delay can turn into a nuisance—and probably a barrier—once you want new channels created day by day by many customers in your group. This additionally locations a heavy burden on admins to assessment and approve an ever rising variety of channels they could lack the context round.

The answer was so as to add the flexibility for admins to automate the approval course of. We created a MySQL desk which represented a connection between two groups. Crew A may authorize computerized approvals for requests from staff B, and vice versa. We wanted a number of database columns to characterize how the automated approvals ought to work. Slack Admins bought a dashboard the place they might go in and configure this setting. This strategy labored nicely, and additional accelerated the expansion of Slack Join. However quickly after we realized we would have liked to customise extra than simply approvals.

Normal resolution to managing per-connection configuration

Along with auto-approvals, we additionally wanted connection-level settings to regulate restrictions on file uploads in Slack Join channels and the flexibility to restrict seen consumer profile fields for exterior customers. In the long run, the plan was to customise the Slack Join expertise on a partner-by-partner degree. The potential for including a brand new database desk per setting was not interesting. We wanted an extensible resolution that would accommodate including new settings with out requiring infrastructure modifications. The primary necessities have been assist for built-in default configuration, a team-wide configuration, and the flexibility to set per-connection configurations. A connection/partner-level configuration permits for a particular setting to be utilized on a goal companion. Default configuration is one thing that comes out of the field, and is the setting which shall be utilized when the admin doesn’t customise something. Org/team-level configuration permits admins to override the default out-of-the-box setting, and shall be utilized in instances when a connection-level setting doesn’t exist. The diagram under describes the sequence during which settings are evaluated and utilized.

Slack Connect prefs

We borrowed from the database schema of the approvals desk, and created a brand new desk with supply and goal staff IDs, and a payload column. The desk regarded like this:

CREATE TABLE `slack_connect_prefs` (
  `team_id` bigint unsigned NOT NULL,
  `target_team_id` bigint unsigned NOT NULL,
  `prefs` mediumblob NOT NULL,
  `date_create` int unsigned NOT NULL,
  `date_update` int unsigned NOT NULL,
  PRIMARY KEY (`team_id`,`target_team_id`),
  KEY `target_team_id` (`target_team_id`)
)

We modeled org-level configuration by setting the goal staff as 0. Companion-level configuration had the staff ID of the connection. We created an index on supply and vacation spot staff IDs which allowed us to effectively question the desk. The desk was additionally partitioned by supply staff ID, which suggests all rows belonging to the supply staff lived on the identical shard. This can be a frequent sharding technique at Slack which permits us to scale horizontally. As an alternative of utilizing a set of columns to mannequin every setting, we opted to make use of a single column with a Protobuf blob because the payload. This allowed us to have complicated knowledge sorts per every setting, whereas additionally decreasing DB storage wants and avoiding the 1,017 columns-per-table restriction. Right here at Slack we now have present tooling for dealing with Protobuf messages, which makes it straightforward to function on the blob columns inside the applying code. The default configuration was carried out in utility code by primarily hardcoding values.

Now that we had a strong storage layer, we needed to construct the applying layer. We utilized an present Slack sample of making a Retailer class to deal with all database interactions with a given desk or a associated set of tables. A retailer is the same idea to a service in a microservices structure. We created a SlackConnectPrefsStore class whose major job was to present shoppers a easy API for interacting with Slack Join prefs. Beneath the hood, this concerned studying from the database or cache, working validation logic, sending occasions and audit logs, and parsing Protobufs. The Protobuf definition regarded like this, with the SlackConnectPrefs message being the container for all subsequent prefs:

message SlackConnectPrefs 
    PrefOne pref_one = 1;
    PrefTwo pref_two = 2;
    ...

message PrefOne 
    bool worth = 1;

Our Retailer class helps get, set, take away, and checklist operations, and makes use of Memcached to cut back database calls when doable. The preliminary Retailer implementation was tightly coupled to the prefs it was working on. For instance, some prefs wanted to ship fanout messages to shoppers a couple of pref state change, so inside our set operate we had a block like this:

operate set(PrefContainer container) 
    ...
    if (container.pref_one != null) 
        send_fanout_message(container.pref_one);
    
    ...

We had code blocks to deal with transformation and validation for every pref, to bust cache, and for error dealing with. This sample was unsustainable: the code grew very lengthy, and making modifications to a retailer operate for a single pref carried a threat of breaking all prefs. The shop design wanted to evolve to have isolation between prefs, and to be simply and safely extendable for brand new prefs.

Evolution of the applying layer

We had two competing concepts to deal with the isolation and extendability issues. One choice was to make use of code era to deal with the transformation, and presumably the validation duties as nicely. The opposite choice was to create wrapper lessons round every pref Protobuf message and have the shop delegate duties to those lessons. After some dialogue and design doc critiques, our staff determined to go together with the wrapper class choice. Whereas code era has in depth tooling, every pref was too completely different to specify as a code-generated template, and would nonetheless require builders to customise sure facets associated to the pref.

We modeled our class construction to replicate the Protobuf definition. We created a container class which was a registry of all supported prefs and delegated duties to them. We created an summary pref class with some frequent summary strategies like rework, isValid, and migrate. Lastly, particular person prefs would inherit from the summary pref class and implement any required strategies. The container class was created from a top-level Protobuf message, SlackConnectPrefs within the instance above. The container then orchestrated creation of particular person pref lessons—PrefOne within the instance above—by taking the related Protobuf sub messages and passing them to their respective lessons. Every pref class knew deal with its personal sub message. The extensibility downside was solved, as a result of every new pref needed to implement its personal class. The implementer didn’t have to have any data of how the shop works and will deal with coding up the summary strategies. To make that job even simpler, our staff invested in creating detailed documentation (and nonetheless continues to replace it because the code evolves). Our goal is to make the Slack Join prefs system self-serve, with little-to-no involvement from our staff.

The ultimate utility layer regarded one thing like this:

The isolation downside was partially solved by this design, however we would have liked an additional layer of safety to make sure that an exception in a single pref didn’t intervene with others. This was dealt with on the container degree. For instance, when the Retailer wanted to examine that every one messages within the Protobuf are legitimate, it could name containers isValid technique. The container would then iterate by way of every pref and name the prefs isValid technique, any exceptions could be caught and logged.

Simplifying administration at scale

Up to now, we now have a strong database layer and a versatile utility layer which could be plugged into locations the place we have to eat pref configuration. On the admin aspect, we now have some dashboards which present details about exterior connections, pending invites, and approvals. The APIs behind the dashboards had a standard sample of studying rows from a number of database tables, combining them collectively, after which making use of search, kind, and filtering primarily based on API request parameters.

This strategy labored advantageous for a number of thousand exterior connections, however the latency stored creeping up, and the variety of timeouts—and consequently triggered alerts—stored rising. The admin dashboard APIs have been making too many database requests, and the ensuing knowledge units have been unbounded within the variety of rows. Including caching helped to a level, however because the variety of connections stored going up, the present sorting, filtering, and search performance was not assembly consumer wants. Efficiency points and missing performance led us to contemplate a distinct sample for admin API handlers.

We shortly dominated out combining a number of database calls right into a single SQL assertion with many joins. Whereas database-level be a part of would have decreased the variety of particular person queries, the price of doing a be a part of over partitioned tables is excessive, and one thing we typically keep away from at Slack. The database partitioning and efficiency of queries is its personal matter, and is described in additional element in Scaling Datastores at Slack with Vitess.

Our different choice was to denormalize the information right into a single knowledge retailer and question it. The talk was centered round which know-how to make use of, with MySQL and Solr being the 2 choices. Each of those choices would require a mechanism to maintain the denormalized view of the information in sync with the supply of reality knowledge. Solr required that we construct an offline job which might rebuild the search index from scratch. MySQL assured studying the information instantly after a write, whereas Solr had a 5 second delay. Then again, Solr paperwork are absolutely listed, which supplies us environment friendly sorting, filtering, and textual content search capabilities with out the necessity to manually add indexes to assist a given question. Solr additionally provides a straightforward question mechanism for array-based fields which aren’t supported in MySQL. Including new fields to a Solr doc is less complicated than including a brand new column to a database desk, ought to we ever have to broaden the information set we function on. After some inside discussions, we opted to go together with the Solr choice for its search capabilities. Ultimately it proved to be the precise alternative: we now have a dashboard which might scale to deal with hundreds of thousands of exterior connections, whereas offering quick text-based looking out and filtering. We additionally took benefit of the flexibility to dynamically add fields to a Solr doc, which allowed for all newly created Slack Join settings to be robotically listed in Solr.

What’s going to we construct subsequent?

The power to have configuration per exterior connection has opened plenty of doorways for us. Our present permission and coverage controls are usually not connection conscious. Making permissions like WhoCanCreateSlackConnectChannels connection-aware can unlock plenty of development potential. Our scaling work isn’t completed and we’ll proceed to have looming challenges to beat on the subject of the variety of linked groups and the variety of linked exterior customers.

When you discovered these technical challenges fascinating, you may as well join our network of employees at Slack!