Written by
Dimitri Yatsenko, PhD
Founder • Chief Science & Technology Officer
August 12, 2017

Strong Entity Integrity: Part 5 — Aggregation

Dimitri Yatsenko, PhD
Founder • Chief Science & Technology Officer

DataJoint’s principle of Strong Entity Integrity (SEI) requires that queries must not introduce new classes of entities: projection and restriction operators preserve the entity class of its operand whereas the join operator yields a new entity class but one comprising the combination of the entity classes of its operands. A DataJoint query make obvious the entity class of its result.

One operator that most egregiously violates this principle in relational algebra and SQL is the aggregation operator also known as GROUP BY. In its traditional formulation, aggregation allows performing summary operations on groups of tuples identified by a combination of grouping attributes. The effective primary key of the result are the grouping attributes. Therefore the query creates a new type of entity in the middle of a query that is not traced to any explicitly defined entity class. SQL allows a further deviation from entity integrity with its ROLLUP functionality producing results comprising mixtures of entity classes with different primary keys.

DataJoint respects SEI and defines aggregation as a binary operator allowing summary operations on subsets of entity set B grouped by entities from class A. It has the following notation: A.aggr(B, ...') where ... is a list of aggregation computations such as n='sum(attr)`. In this way aggregation is a form of projection, preserving the entity class of A.

A.aggr(B, ...) is equivalent to SELECT ... FROM B GROUP BY . As always, the primary key cannot be omitted.

DataJoint simply makes explicit what programmers imply implicitly when they perform a GROUP BY operation. When grouping, we conceive of some sort of entity represented by the grouping attributes. In most cases, that entity class is already defined, often in the form of a table. Yet SQL forces users to spell out the list of attributes rather than explicitly specify the aggregating entity set.

For example, consider the following table definition for college course grades:

@schema

class Grade(dj.Manual):

   definition = """

   -> Student

   -> Course

   ---

   grade: decimal(3, 2)  # grade e.g. 3.67

   """

Then studentss GPA can be computed as

Student().aggr(Grade(), gpa='AVG(grade)')

The entity class of the result is the same as that of the aggregating entity: the result is still a set entities of class Student.

Related posts

Updates Delivered *Straight to Your Inbox*

Join the mailing list for industry insights, company news, and product updates delivered monthly.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.