April 15, 2024
Pinterest Engineering
Pinterest Engineering Blog

Jeremy Krach | Workers Safety Engineer, Platform Safety

A number of years in the past, Pinterest had a brief incident because of oversights within the coverage supply engine. This engine is the know-how that ensures a coverage doc written by a developer and checked into supply management is absolutely delivered to the manufacturing system evaluating that coverage, much like OPAL. This incident started a multi-year journey for our staff to rethink coverage supply and migrate a whole lot of insurance policies to a brand new distribution mannequin. We shared particulars about our former coverage supply system in a convention speak from Kubecon 2019.

At a excessive stage, there are three necessary architectural choices we’d wish to carry consideration to for this story.

Determine 1: Outdated coverage distribution structure, utilizing S3 and Zookeeper.
  1. Pinterest supplies a wrapper service round OPA as a way to handle coverage distribution, agent configuration metrics, logging, and simplified APIs.
  2. Insurance policies have been fetched routinely by way of Zookeeper as quickly as a brand new model was revealed.
  3. Insurance policies lived in a shared Phabricator repository that was revealed by way of a CI workflow.

So the place did this go unsuitable? Primarily, unhealthy variations (50+ on the time) of each coverage have been revealed concurrently because of a foul decide to the coverage repository. These unhealthy variations have been revealed to S3, with new variations registered in Zookeeper and pulled immediately into manufacturing. This brought about a lot of our inner providers to fail concurrently. Thankfully a fast re-run of our CI revealed identified good variations that have been (once more) pulled immediately into manufacturing.

This incident led a number of groups to start rethinking world configuration (like OPA coverage). Particularly, the Safety staff and Visitors staff at Pinterest started collaborating on a brand new configuration supply system that would supply a mechanism to outline deployment pipelines for configuration.

This weblog submit is targeted on how the Safety staff moved a whole lot of insurance policies and dozens of shoppers from the Zookeeper mannequin to a safer, extra dependable, and extra configurable config deployment method.

The core configuration supply story right here isn’t the Safety staff’s to inform — Pinterest’s Visitors staff labored carefully with us to grasp our necessities, and that staff was in the end chargeable for constructing out the core know-how to allow our integration.

Usually talking, the brand new configuration administration system works as follows:

  1. Config house owners create their configuration in a shared repository.
  2. Configs are grouped by service house owners into “artifacts” in a DSL in that repository.
  3. Artifacts are configured with a pipeline, additionally in a DSL in that repository. This defines which methods obtain the artifact and when.

Every pipeline defines a set of steps and a set of supply scopes for every step. These scopes are generated domestically on every system that want to retrieve a configuration. For instance, one may outline a pipeline that first delivers to the canary system after which the manufacturing system, (simplified right here):

The DSL additionally permits for configuration round how pipeline steps are promoted — automated (inside enterprise hours), automated (24×7), and guide. It additionally permits for configuration of metric thresholds that should not be exceeded earlier than continuing to the following step.

The precise distribution know-how will not be dissimilar to the unique structure. Now, as an alternative of publishing coverage in a world CI job, every artifact (group of coverage and different configuration) has a devoted pipeline to outline the scope of supply and the triggers for the supply. This ensures every coverage rollout is remoted to simply that system and may have no matter deployment technique and security checks that the service proprietor deems acceptable. A high-level structure may be seen under.

Determine 2: New coverage distribution structure, utilizing Config server/sidecar and devoted UI.

Part 1: Tooling and Stock

Earlier than we may start migrating insurance policies from a world, instantaneous deployment mannequin to a focused, staged deployment mannequin, quite a lot of info wanted to be collected. Particularly, for every coverage file in our previous configuration repository we wanted to establish:

  1. The service and Github staff related to the coverage
  2. The methods utilizing the coverage
  3. The popular deploy order for the methods utilizing the coverage

Thankfully, most of this info was available from a handful of information sources at Pinterest. Throughout this primary section of the migration, we developed a script to gather all this metadata about every coverage. This concerned: studying every coverage file to tug the related service title from a compulsory tag remark, fetching the Github staff related to the service from our inner stock API, getting metrics for all methods with site visitors for the coverage, and grouping these methods right into a tough classification based mostly on a number of widespread naming conventions. As soon as this information was generated, we exported it to Google sheets as a way to annotate it with some guide tweaks. Particularly, some methods have been misattributed to house owners because of stale possession information, and plenty of methods didn’t comply with customary, predictable naming conventions.

The following piece of tooling we developed was a script that took a number of items of enter: the trail to the coverage to be migrated, the staff names, and the deployment steps. This routinely moved the coverage from the previous repository to the brand new one, generated an artifact that included the coverage, and outlined a deployment pipeline for the related methods attributed to the service proprietor.

With all this tooling in hand, we have been prepared to begin testing the migration tooling towards some easy examples.

Part 2: Cutover Logic

Previous to the brand new coverage supply mannequin, groups would outline their coverage subscriptions in a config file managed by Telefig. Considered one of our objectives for this migration was making certain a seamless cutover that required minimal or no buyer modifications. Because the new configuration administration offered the idea of scopes and outlined the coverage subscription within the configuration repository, we may rely purely on the brand new repository to outline the place insurance policies have been wanted. We wanted to replace our sidecar (the OPA wrapper) to generate subscription scopes domestically throughout start-up based mostly on system attributes. We selected to generate these scopes based mostly on the SPIFFE ID of the system, which allowed us to couple the deployments carefully to the service and setting of the host.

We additionally acknowledged that for the reason that configuration system can ship arbitrary configs, we may additionally ship a configuration telling our OPA wrapper to change its habits. We carried out this cutover logic as a hot-reload of configuration within the OPA wrapper. When a brand new configuration file was created, the OPA wrapper detects the brand new configuration and modifications the next properties:

  1. The place the insurance policies are saved on disk (reload of the OPA runtime engine)
  2. How the insurance policies are up to date on disk (ZooKeeper subscription outlined by buyer managed configuration file vs. doing nothing and permitting the configuration sidecar to handle it)
  3. Metric tags, to permit detection of cutover progress
Determine 3: Flowchart of the coverage cutover logic.

One advantage of this method is that reverting the coverage distribution mechanism might be completed fully within the new system. If a service didn’t work properly with the brand new deployment system, we may use the brand new deployment system to replace the brand new configuration file to inform the OPA wrapper to make use of the legacy habits. Switching between modes might be completed seamlessly with no downtime or affect to prospects utilizing insurance policies.

Since each the coverage setup and the cutover configuration may occur in a single repository, every coverage or service might be migrated with a single pull request with none want for buyer enter. All information within the new repository might be generated with our previously-built tooling. This set the stage for an extended collection of migrations with localized affect to solely the coverage being migrated.

At this level, the muse was laid to start the migration in earnest. Over the course of a month or two, we started auto-generating pull-requests scoped to single groups or coverage. Primarily Safety and Visitors staff members generated and reviewed these PRs to make sure the deployments have been correctly scoped, related to the proper groups, and rolled out efficiently.

As talked about earlier than, we had a whole lot of insurance policies that wanted to be migrated, so this was a gentle however lengthy strategy of shifting insurance policies in chunks. As we gained confidence in our tooling, we ramped up the variety of insurance policies migrated in a given PR from 1–2 to 10–20.

As with all plan, there have been some unexpected points that got here up as we deployed insurance policies to a extra numerous set of methods. What we discovered was that a few of our older stateful methods have been working an older machine picture (AMI) that didn’t help subscription declaration. This introduced a direct roadblock for progress on methods that might not simply be relaunched.

Thankfully, our Steady Deployment staff was actively revising how the Telefig service receives updates. We labored carefully with the CD staff to make sure that we dynamically upgraded all methods at Pinterest to make use of the most recent model of Telefig. This unblocked our work and allowed us to proceed migrating the remaining use circumstances.

As soon as we resolved the previous Telefig model problem, we shortly labored with the few groups that owned the majority of the remaining insurance policies to get every thing moved over into the brand new configuration deployment mannequin. Under is a tough timeline of the migration:

Determine 4: Timeline of the migration to the brand new coverage framework.

As soon as the metrics above stabilized at 100%, we started cleansing up the previous tooling. This allowed us to delete a whole lot of strains of code and drastically simplify the OPA wrapper, because it not needed to construct in coverage distribution logic.

On the finish of this course of, we now have a safer coverage deployment platform that permits our groups to have full management over their deployment pipelines and absolutely isolate every deployment from insurance policies not included in that deployment.

Migrating issues is difficult. There’s all the time resistance to a brand new workflow, and the extra those who should work together with it, the longer the tail on the migration. The primary takeaways from this migration are as follows.

Deal with measurement first. With a view to keep on monitor, it is advisable know who can be impacted, the scope of what work stays, and what massive wins are behind you. Having good measurement additionally helps justify the mission and offers an awesome set of assets to brag about accomplishments at milestones alongside the best way.

Secondly, migrations usually comply with the Pareto Principle. Particularly, 20% of the use-cases to be migrated will usually account for 80% of the outcomes. That is seen within the timeline chart above — there are two big spikes in progress (one in mid April and one a number of weeks later). These spikes are consultant of migrations for 2 groups, however they characterize an outsized proportion of the general standing. Maintain this in thoughts when prioritizing which methods emigrate, as generally spending quite a lot of time simply emigrate one staff or system may have a disproportionate payoff.

Lastly, anticipate points however be able to adapt. Spend time early within the course of considering by your edge circumstances, however depart your self additional time on the roadmap to account for points that you might not predict. Just a little little bit of buffer goes a great distance for peace of thoughts and if you happen to occur to ship the outcomes early, that’s an awesome win to have fun!

This work wouldn’t have been attainable with out an enormous group of individuals working collectively over the previous few years to construct one of the best system attainable.

Enormous because of our companions on the Visitors staff for constructing out a strong configuration deployment system and onboarding us as the primary large-scale manufacturing use case. Particularly, because of Tian Zhao who led most of our collaboration and was instrumental in getting our use-case onboarded. Further because of Zhewei Hu, James Fish and Scott Beardsley.

The safety staff was additionally an enormous assist in reviewing the structure, migration plans and pull-requests. Particularly Teagan Todd was an enormous assist in working many of those migrations. Additionally Yuping Li, Kevin Hock and Cedric Staub.

When encountering points with older methods, Anh Nguyen was a large assist in upgrading methods underneath the hood.

Lastly, thanks to our companions on groups that owned a considerable amount of insurance policies, as they helped us push the migration ahead by performing their very own migrations: Aneesh Nelavelly, Vivian Huang, James Fraser, Gabriel Raphael Garcia Montoya, Liqi Yi (He Him), Qi LI, Mauricio Rivera and Harekam Singh.

To be taught extra about engineering at Pinterest, try the remainder of our Engineering Weblog and go to our Pinterest Labs site. To discover and apply to open roles, go to our Careers web page.