May 18, 2024
  • UPM is our inside standalone library to carry out static analysis of SQL code and improve SQL authoring. 
  • UPM takes SQL code as enter and represents it as an information construction known as a semantic tree.
  • Infrastructure groups at Meta leverage UPM to construct SQL linters, catch person errors in SQL code, and carry out knowledge lineage evaluation at scale.

Executing SQL queries towards our knowledge warehouse is vital to the workflows of many engineers and knowledge scientists at Meta for analytics and monitoring use circumstances, both as a part of recurring knowledge pipelines or for ad-hoc knowledge exploration. 

Whereas SQL is extraordinarily highly effective and highly regarded amongst our engineers, we’ve additionally confronted some challenges through the years, specifically: 

  • A necessity for static evaluation capabilities: In a rising variety of use circumstances at Meta, we should perceive programmatically what occurs in SQL queries earlier than they’re executed towards our question engines — a job known as static evaluation.  These use circumstances vary from efficiency linters (suggesting question optimizations that question engines can’t carry out robotically) and analyzing knowledge lineage (tracing how knowledge flows from one desk to a different). This was exhausting for us to do for 2 causes: First, whereas question engines internally have some capabilities to research a SQL question with the intention to execute it, this question evaluation part is often deeply embedded contained in the question engine’s code. It isn’t straightforward to increase upon, and it’s not supposed for consumption by different infrastructure groups. Along with this, every question engine has its personal evaluation logic, particular to its personal SQL dialect; consequently, a workforce who desires to construct a bit of study for SQL queries must reimplement it from scratch inside of every SQL question engine.
  • A limiting sort system: Initially, we used solely the mounted set of built-in Hive data types (string, integer, boolean, and so forth.) to explain desk columns in our knowledge warehouse. As our warehouse grew extra complicated, this set of sorts turned inadequate, because it left us unable to catch widespread classes of person errors, similar to unit errors (think about making a UNION between two tables, each of which comprise a column known as timestamp, however one is encoded in milliseconds and the opposite one in nanoseconds), or ID comparability errors (think about a JOIN between two tables, every with a column known as user_id — however, in actual fact, these IDs are issued by totally different techniques and due to this fact can’t be in contrast).

How UPM works

To deal with these challenges, we have now constructed UPM (Unified Programming Mannequin). UPM takes in an SQL question as enter and represents it as a hierarchical knowledge construction known as a semantic tree.

 For instance, should you move on this question to UPM:

SELECT
COUNT(DISTINCT user_id) AS n_users
FROM login_events

UPM will return this semantic tree:

SelectQuery(
 	gadgets=[
 	SelectItem(
       	name="n_users",
       	type=upm.Integer,
       	value=CallExpression(
            	function=upm.builtin.COUNT_DISTINCT,
                arguments=[ColumnRef(name="user_id", parent=Table("login_events"))],
       	),
 	)
    ],
    mum or dad=Desk("login_events"),
)

 Different instruments can then use this semantic tree for various use circumstances, similar to:

  1. Static evaluation: A device can examine the semantic tree after which output diagnostics or warnings concerning the question (similar to a SQL linter).
  2. Question rewriting: A device can modify the semantic tree to rewrite the question.
  3. Question execution: UPM can act as a pluggable SQL entrance finish, which means {that a} database engine or question engine can use a UPM semantic tree on to generate and execute a question plan. (The phrase front end on this context is borrowed from the world of compilers; the entrance finish is the a part of a compiler that converts higher-level code into an intermediate illustration that can finally be used to generate an executable program). Alternatively, UPM can render the semantic tree again right into a goal SQL dialect (as a string) and move that to the question engine.

A unified SQL language entrance finish

UPM permits us to supply a single language entrance finish to our SQL customers in order that they solely have to work with a single language (a superset of the Presto SQL dialect) — whether or not their goal engine is Presto, Spark, or XStream, our in-house stream processing service.

This unification can also be helpful to our knowledge infrastructure groups: Due to this unification, groups that personal SQL static evaluation or rewriting instruments can use UPM semantic timber as an ordinary interop format, with out worrying about parsing, evaluation, or integration with totally different SQL question engines and SQL dialects. Equally, very like Velox can act as a pluggable execution engine for knowledge administration techniques, UPM can act as a pluggable language entrance finish for knowledge administration techniques, saving groups the hassle of sustaining their very own SQL entrance finish.

Enhanced type-checking

UPM additionally permits us to supply enhanced type-checking of SQL queries.

 In our warehouse, every desk column is assigned a “bodily” sort from a set listing, similar to integer or string. Moreover, every column can have an elective user-defined sort; whereas it doesn’t have an effect on how the information is encoded on disk, this kind can provide semantic data (e.g., Electronic mail, TimestampMilliseconds, or UserID). UPM can reap the benefits of these user-defined sorts to enhance static type-checking of SQL queries.

 For instance, an SQL question writer would possibly wish to UNION knowledge from two tables that comprise details about totally different login occasions:

 Within the question on the fitting, the writer is making an attempt to mix timestamps in milliseconds from the desk user_login_events_mobile with timestamps in nanoseconds from the desk user_login_events_desktop — an comprehensible mistake, as the 2 columns have the identical identify. However as a result of the tables’ schema have been annotated with user-defined sorts, UPM’s typechecker catches the error earlier than the question reaches the question engine; it then notifies the writer of their code editor. With out this test, the question would have accomplished efficiently, and the writer may not have seen the error till a lot later.

Column-level knowledge lineage

Knowledge lineage — understanding how knowledge flows inside our warehouse and thru to consumption surfaces — is a foundational piece of our knowledge infrastructure. It allows us to reply knowledge high quality questions (e.g.,“This knowledge seems incorrect; the place is it coming from?” and “Knowledge on this desk had been corrupted; which downstream knowledge belongings had been impacted?”). It additionally helps with knowledge refactoring (“Is that this desk secure to delete? Is anybody nonetheless relying on it?”). 

 To assist us reply these important questions, our knowledge lineage workforce has constructed a question evaluation device that takes UPM semantic timber as enter. The device examines all recurring SQL queries to construct a column-level knowledge lineage graph throughout our total warehouse. For instance, given this question:

INSERT INTO user_logins_daily_agg
SELECT
   DATE(login_timestamp) AS day,
   COUNT(DISTINCT user_id) AS n_users
FROM user_login_events
GROUP BY 1

Our UPM-powered column lineage evaluation would deduce these edges:

[
   from: “user_login_events.login_timestamp”,
   to: “user_login_daily_agg.day”,
   transform: “DATE”
,

   from: “user_login_events.user_id”,
   to: “user_logins_daily_agg.n_user”,
   transform: “COUNT_DISTINCT”
]  

By placing this data collectively for each question executed towards our knowledge warehouse every day, the device reveals us a world view of the total column-level knowledge lineage graph.

What’s subsequent for UPM

We sit up for extra thrilling work as we proceed to unlock UPM’s full potential at Meta. Ultimately, we hope all Meta warehouse tables shall be annotated with user-defined sorts and different metadata, and that enhanced type-checking shall be strictly enforced in each authoring floor. Most tables in our Hive warehouse already leverage user-defined sorts, however we’re rolling out stricter type-checking guidelines steadily, to facilitate the migration of current SQL pipelines.

We now have already built-in UPM into the primary surfaces the place Meta’s builders write SQL, and our long-term purpose is for UPM to change into Meta’s unified SQL entrance finish: deeply built-in into all our question engines, exposing a single SQL dialect to our builders. We additionally intend to iterate on the ergonomics of this unified SQL dialect (for instance, by permitting trailing commas in SELECT clauses and by supporting syntax constructs like SELECT * EXCEPT <some_columns>, which exist already in some SQL dialects) and to finally elevate the extent of abstraction at which individuals write their queries.