Insights & Ideas - DataJoint

When Scientific AI Forgets How It Got There

Pharma R&D has spent the last three years quietly making AI load-bearing. Models now propose targets, score compounds, design assays, summarize literature, and increasingly draft sections of regulatory documents. The conversation has been about capability — what AI can do. The conversation we are not having, and need to, is about provenance — what the institution can prove about how it did it.

The argument is narrow and specific: scientific AI without deterministic provenance is not just messy. It is institutionally dangerous.

By deterministic provenance, I mean the ability to point at any AI-derived result — a hit list, a predicted structure, an annotated cohort, a generated experimental protocol — and reconstruct, bit-for-bit, the data, code, model version, parameters, random seeds, and upstream transformations that produced it. Not “logged somewhere.” Not “the analyst remembers.” Reproducible on demand, by someone other than the person who first ran it.

Most pharma AI today does not meet this bar. Notebooks reference data files whose schemas have since drifted. Foundation models are version-pinned on paper but rarely in practice. Pipelines stitch together SaaS APIs, internal scripts, and curated spreadsheets. The output looks crisp; the lineage is a smear.

This is fine when AI is a brainstorming partner. It becomes dangerous the moment AI output enters the institutional record — IND-enabling packages, IP filings, partnership data rooms, internal go/no-go decisions. Three failure modes follow.

First, regulatory exposure. FDA and EMA expectations are converging on reconstructable analyses. “Our model said so” is not a defense if the model, weights, and inputs cannot be reproduced. Sponsors who accelerate with AI but cannot defend the chain of derivation are accumulating a regulatory debt that comes due during inspection, not before.

Second, audit blindness when programs fail. Most clinical and preclinical programs fail. The institutional value of a failed program is the post-mortem — what did we believe, why did we believe it, where did the evidence break. When AI-derived intermediates cannot be reconstructed, the post-mortem cannot be performed honestly. The organization loses the ability to learn from its own failures, which is the most expensive form of institutional damage.

Third, decision velocity outrunning evidence integrity. AI compresses cycle times, which is the point. But the same compression means provenance debt accrues faster than humans can repair it. A team can make twelve AI-assisted decisions in the time it used to make one, with one-twelfth the lineage discipline. The danger is not that any one decision is wrong. It is that the organization can no longer tell which decisions rest on which evidence.

The fix is not more logging. Logging is observational and lossy. Deterministic provenance has to be a property of the system that produces results, not a record kept alongside it — data, code, and computation tracked in the same structure, with versioning and lineage that are queryable and reproducible. Designed in, not bolted on.

Leaders evaluating AI tooling should ask a single question of every vendor and every internal team: if a regulator, a partner, or a future post-mortem asks how this result was produced, can we rebuild it from if a regulator, a partner, or afuture post-mortem asks how this result was produced, can we rebuild it from first inputs, without depending on the person who ran it? If the honest answer is no, the institution is not adopting AI. It is borrowing against its own credibility.

The organizations that win the next decade of scientific AI will not be the ones with the largest models. They will be the ones whose AI outputs are still defensible three years after the analyst has left.

This is the conviction behind how we built the DataJoint platform: provenance as a first-class property of scientific computation, not alogbook kept beside it.

‍

May 12, 2026

Insights & Ideas
Neuropixels, Plainly Explained
A 1-Minute Overview of Neuropixels

What if you could listen to hundreds of brain cells at once and still keep the data trustworthy and easy to share? This is what Neuropixels makes possible and how the community turns those signals into reliable results.

Quick hits: what Neuropixels make possible

Neuropixels are silicon probes that continuously detect tiny voltage changes from nearby neurons. Neuropixels probes, first released in 2017, set a new gold standard for large-scale electrophysiology [1]. The goal was simple to state and hard to do: record many single neurons in freely behaving animals with light, flexible hardware and many recording channels. Earlier probe stacks, such as passive NeuroNexus shanks with INTAN headstages, offered only tens of channels and relied on long analog cables that picked up electrical noise.

Neuropixels take a different path. Using complementary metal oxide semiconductor (CMOS) fabrication, the probe integrates amplification, multiplexing, and digitization on the device itself. Placing the electronics next to the recording sites shortens analog paths, reduces noise, and enables hundreds of addressable channels. This was not practical earlier because fabrication yields on long slender shanks were low, power and heat budgets were tight, and packaging had to remain biocompatible. Process advances and coordinated investment made it feasible. The result was a step change in experimental capability: stable, simultaneous recordings from hundreds of neurons across multiple brain regions in freely moving animals [1].

Adoption is broad. Our conservative estimate is that more than 850 laboratories have used Neuropixels since 2017*.

Who built it

The Neuropixels project is an international collaboration led by IMEC (Interuniversity Microelectronics Centre, Belgium) with partners including the Wellcome Trust, Howard Hughes Medical Institute (HHMI) Janelia Research Campus (Timothy Harris and colleagues), University College London (Matteo Carandini, Nicholas Steinmetz, and collaborators), the Allen Institute for Brain Science (Christof Koch and colleagues), Gatsby Charitable Foundation, the Sainsbury Wellcome Centre (John O’Keefe and colleagues), among others acknowledged in the original publications [1-3]. Their collaboration transformed what was once a technological aspiration into an accessible community resource. Neuropixels 2.0, co-developed with Cambridge NeuroTech, built upon this foundation with improved reusability and chronic stability [4,5].

From 1.0 to Opto: The Probe Line-Up

The development of Neuropixels has progressed through a series of significant releases, each enhancing its capabilities and resolution. This evolution highlights the ongoing advancements in neural recording technology.

Most Neuropixels models allow you to select up to 384 channels to record simultaneously from many more physical sites on the shank. The variants below differ mainly in site geometry, density, shank layout, and added capabilities (primate length, ultra-dense mapping, or built-in light for optogenetics).

Illustration showing the evolution of Neuropixels variants. Together, these releases extend Neuropixels from rodent studies to larger brains and from recording alone to tightly coupled recording and control. ‘Sites’ refer to individual electrode contact points, while ‘channels’ are the signals that can be read simultaneously.

Since its debut in 2017, the Neuropixels series has revolutionized neuroscience research. The original Neuropixels 1.0 featured 960 recording sites setting a new standard for high-density neural data collection in rodents. Building on that success, Neuropixels 2.0 arrived in 2021 with 1,280 recording sites per shank, ideal for chronic experiments with freely moving animals. By 2023, the lineup expanded further to include Neuropixels NHP for primates and clinical research. The same year, Neuropixels Ultra dramatically increased spatial resolution by pushing site density even higher for more precise spike detection and cell-type identification. Most exciting, the 2025 introduction of Neuropixels Opto combines electrophysiology with neurophotonics, enabling simultaneous electrical recordings and optical manipulation of neuronal activity through optogenetics.

These innovations represent a significant leap forward in neural research, evolving from small-scale rodent studies to detailed, large-scale analyses of the brain system. This includes ultra-high-resolution mapping and multimodal data integration, opening an exciting new chapter in neural exploration.

The Three-Step Guide to Choosing a Probe

Different probe designs serve different needs. Your choice should follow the question, species, surgery plan, recording duration, and budget.

Neuropixels integrates on-probe CMOS electronics with very high site counts per shank. Helpful for dense, stable, multi-region recordings in freely moving animals.

NeuroNexus offers a broad range of passive silicon probes in many geometries for acute and chronic in vivo recordings. Probes pair with external or headstage electronics, which keep them slender and flexible in configuration. Compared with Neuropixels, per-shank site counts are lower, with 32 sites per shank, and digitization occurs off-probe. Useful when you need custom layouts or thinner shanks.

INTAN Technologies offers low-noise amplification and digitization on the headstage used across many passive probes, including NeuroNexus and Cambridge NeuroTech. Adds headstage weight and analog cable length compared with on-probe designs. Flexible and cost-aware.

Cambridge NeuroTech focuses on chronic-focused silicon passive probes and implant hardware; partner on Neuropixels 2.0. Pairs with external electronics; their strengths include chronic implant hardware and diverse probe geometries. Strong for long-term stability and surgical accessories.

NeuroSeeker (EU-funded project) was an earlier initiative involving the Harris Lab at Janelia, IMEC, and UCL that advanced dense CMOS concepts that informed today’s designs.

The Masmanidis Lab at UCLA creates silicon probes with custom geometries, often distributed through collaborations with NeuroNexus or Cambridge NeuroTech. Good for specialised layouts.

Diagnostic Biochips builds high-density, small-form-factor probes designed for chronic rodent work. Emphasizes compact headstages and implant practicality.

How to choose:
1. Need highest per-shank site density and fewer external cables? Consider on-probe-CMOS designs.
2. Need unusual geometries or ultra-thin shanks? Consider passive silicon lines with external headstages.
3. Long-term chronic implant with specific surgical hardware? Choose vendors with mature chronic ecosystems (e.g., Cambridge NeuroTech, Diagnostic Biochips) and proven accessories.
From Acquisition to Sorting: The Open Toolchain at a Glance

Open tools made Neuropixels practical at scale:
- SpikeGLX for high-performance data acquisition software optimized for Neuropixels. It enables stable, low-latency streaming of hundreds of channels. Developed by Bill Karsh (HHMI Janelia) and colleagues.
- Open Ephys for a modular, open-source acquisition hardware and software for real-time electrophysiology. While compatible with multiple probes, it has been widely adopted for Neuropixels acquisition. Created by Josh Siegle (Allen Institute) and Jakob Voigts (MIT, Open Ephys company).
- Kilosort for fast, GPU-based spike sorting tuned for dense data produced by Neuropixels. Developed by Marius Pachitariu and collaborators (originally at UCL and Janelia).
- SpikeInterface for standardized preprocessing, spike sorting, validation, and comparison across multiple algorithms, including Kilosort. It integrates smoothly into Neuropixels workflows. Led by Alessio Buccino and Cyrille Rossant with an international community.
- The Allen Institute Ecephys pipeline for quality metrics and alignment practices, forming the backbone of the Allen Brain Observatory’s large-scale projects. Maintained by Josh Siegle and the Allen Institute team.
Together, these tools create an ecosystem that supports the acquisition, processing, and analysis of Neuropixels data with rigor and reproducibility. Their open-source nature has ensured broad adoption across labs worldwide, cementing Neuropixels as the foundation for modern systems neuroscience.

DataJoint in Action: What You Get

Running Neuropixels at scale is a data complexity problem, especially when done in multimodal experiments synchronizing other instruments.

DataJoint directly addresses the challenge of managing and analyzing terabyte-scale Neuropixels datasets in combination with other instruments as reproducible, cloud-ready workflows.

Open-Source DataJoint Pipeline at hand

The Element Array Ephys, a component of the NIH U24-funded DataJoint Elements initiative, delivers validated Neuropixels workflows for acquisition and analysis:
1. Ingests from SpikeGLX and Open Ephys automatically
2. runs spike sorting with Kilosort and other algorithms through SpikeInterface
3. Syncs and integrates with behavior, stimulation, and imaging
4. Writes results to Neurodata Without Borders (NWB) for sharing
5. Stores everything in structured, queryable databases for collaborative work.
Labs apply these approaches to deliver rigorous, reproducible analyses on large datasets. DataJoint helps you spend less time on infrastructure and more time on experiments.

Rigor and Transparency

The DataJoint Platform is a computational environment for scaling modern neuroscience — on the cloud or on premises in research labs. It automates steps from acquisition to sorting and analysis, scales to terabytes, integrates electrophysiology with multiple modalities, including behavior, imaging, and stimulation, and ships as open, community-validated workflows. FAIR practices and NWB export are built in. Additional capabilities include manual spike curation, advanced quality metrics for spikes and units, event-aligned analysis (e.g., PSTHs), and efficient compression of raw data for faster uploads.

‍

Looking Ahead

Neuropixels continue to expand the boundaries of what is experimentally possible. As density and multimodal designs grow, workflows must keep pace. By working with the community and delivering scalable, cloud-based pipelines, DataJoint helps turn rich recordings into reliable, shareable science.

Upload. Analyze. Share. No local infrastructure required.
- Explore Element Array Ephys tutorials: https://github.com/datajoint/element-array-ephys
- Contact us for a demo of DataJoint SciOps for Neuropixels. ‍
- See you at SfN 2025! We’re excited to showcase our latest research. Don’t miss our nanosymposium presentation titled “Parametric Stimuli Reveal Functional Subcircuits in Visual Cortex.” Be sure to visit our poster, #PSTR198, where we’ll present “A Principled Framework for Compression and Standardization of Multiphoton Data.” Excited to connect with you there!
‍

* Estimated from PubMed papers mentioning ‘Neuropixels’ since 2017, de-duplicated by institution and counting consortium papers once. This is a conservative lower bound.

References

[1] Jun, J. J., Steinmetz, N. A., Siegle, J. H., et al. (2017). Fully integrated silicon probes for high-density recording of neural activity. Nature, 551(7679), 232–236.

[2] Simons Foundation (2017). ‘Neuropixels’ expand access to the brain

[3] The Brain Probe Consortium: Neuropixels silicon probes.

[4] Steinmetz, N. A., Koch, C., Harris, T. D., and Carandini, M., et al. (2021). Neuropixels 2.0: A miniaturized high-density probe for stable, long-term brain recordings. Science.

[5] Siegle, J. H., et al. (2017). Open Ephys: An open-source, plugin-based platform for multichannel electrophysiology. J Neural Eng, 14(4).

‍
September 26, 2025

Insights & Ideas
AI and the Evolution of Relational Schemas
‍

(Previously in this series: “Power of Schemas,” which detailed the theoretical foundations of structured data, and “The Great Data Debate,” which introduced schema-on-write vs. schema-on-read.)

AI and the Evolution of Relational Schemas

The argument often surfaces that Artificial Intelligence thrives on unstructured data, framing the “rigidity” of schemas (as discussed in our second post) as a hindrance. However, this perceived rigidity is precisely what ensures data integrity—the accuracy, consistency, and reliability of data. And for AI to produce trustworthy results, integrity is paramount.

Why AI Still Needs a Backbone of Integrity

Key aspects of data integrity, often enforced by well-defined schemas, include:
- Entity Integrity: Ensuring each real-world entity is uniquely identified. Think of this as every citizen having a unique ID, preventing confusion.
- Referential Integrity: Guaranteeing that relationships between data remain valid. This ensures, for example, that lab results are correctly linked to the specific patient, preventing critical misattributions.
- Group (Compositional) Integrity: Treating entities composed of multiple essential parts as inseparable units. For instance, if an algorithm extracts several signal traces from one recording, group integrity ensures these are managed as a complete set; removing one arbitrarily would invalidate the analysis.
The rise of AI doesn’t fundamentally change the tradeoffs between schema-on-write and schema-on-read (explored in our first post). While AI can process unstructured “data soup,” an AI working with well-structured data is like a detective with neatly organized evidence logs—connections are clearer, verification is easier, and conclusions are far more reliable. In fact, an AI might even express its understanding of unstructured data by constructing a relational schema, offering a verifiable representation of its inferred findings.

Evolving the Model for Modern Data Challenges

Traditional relational implementations do face challenges with modern data:
1. Handling Large Objects: Efficiently storing and querying massive objects like videos or raw instrument outputs can be impractical in classic relational structures.
2. Schema Evolution: Modifying schemas in large, live databases can be cumbersome, hindering agility.
3. Integrating Computation: Deeply embedding complex computations (often in Python) and managing their dependencies within the data model requires extensions beyond standard relational frameworks.
Addressing these means evolving the relational approach. New models need to support large objects more natively, allow schemas to adapt without sacrificing integrity, and treat computation as an integral part of the data pipeline. AI itself could aid this, potentially inferring relational structures from unstructured data, providing a verifiable hypothesis about its underlying organization.

DataJoint: A Modern Example of Structured, Computable Data Management

The DataJoint framework exemplifies such an evolved, structured approach, especially for scientific AI applications. It refines the relational model by integrating computational dependencies directly into the schema. This treats computations as first-class citizens, allowing entire scientific workflows—from data acquisition and processing to analysis—to be represented as a unified, integrity-checked data pipeline. Imagine a digital lab notebook combined with an automated assistant, where every experiment (computation) is precisely linked to its data inputs and methods, ensuring results are traceable, verifiable, and reproducible.

The Enduring Need for Structure

Ultimately, the choice between structured, unstructured, or hybrid data strategies depends on specific needs. Where rapid ingestion of diverse data is key and some inconsistency is tolerable, schema-on-read holds advantages. However, for systems demanding high data integrity, consistency, and provable relationships—especially when AI is involved in critical decision-making—the mathematical rigor and enforcement capabilities of well-defined schemas remain essential. AI is a powerful analytical tool, but it doesn’t negate the foundational need for structure when trustworthiness and reliability are non-negotiable.
August 26, 2025

Insights & Ideas
Insight Entrepreneurship: A New Vision for Science
This article is Part 3 of our three-part series, Entrepreneurs of Insight. In Part 1, we traced the breakdown of the old compact between science and society. In Part 2, we examined the promise and peril of the Executive Order on “Restoring Gold Standard Science.” Here, we introduce a forward-looking model: Insight Entrepreneurship.

Introducing: Insight Entrepreneurship

This new vision, Insight Entrepreneurship, reframes the role of scientists and research entities. It calls for them to become proactive stewards and developers of knowledge, taking greater ownership of their intellectual endeavors and the insights they produce. “Insight” is the core currency – encompassing not only discoveries with commercial potential but also fundamental breakthroughs that answer deep questions and satisfy broad human curiosity. “Entrepreneurship” here signifies a mindset of innovation, strategic resource management, value creation (intellectual, societal, and economic), and accountability for the integrity and impact of one’s work.

Core Tenets of Insight Entrepreneurship:

The Scientist as an Empowered Entrepreneur of Insight: Individual researchers and their teams are viewed as primary engines of discovery and “entrepreneurs of knowledge.” This vision empowers them with significant autonomy and encourages a form of ownership over their core intellectual contributions – including methodologies, experimental designs, software, curated datasets, and the unique insights generated. This stewardship extends to having substantial strategic input into how research funding is deployed and leveraged, enabling them to build and sustain their “intellectual capital” for long-term, impactful inquiry. The goal is to foster an environment where brave, independent, and creative endeavors can thrive.

Accountability through Radical Integrity and Verifiability: This enhanced autonomy and ownership are inextricably bound to a profound and demonstrable commitment to radical integrity. Researchers and their institutions must champion and adhere to the highest standards of rigor, transparency in methodology, and verifiability of findings. While the immediate, unconditional release of all raw data from publicly funded projects may not always be the optimal strategy for building intellectual capital or ensuring data quality and context, verifiability remains paramount. The scientific community must develop and adopt robust mechanisms to ensure that all claims are traceable, methodologies are transparent (even if access to underlying raw datasets is managed), and findings can be independently scrutinized and validated under clear, ethical guidelines. This demonstrable integrity is the bedrock of trustworthiness.

Strategic Management of Knowledge Assets for Diverse Value Creation: All outputs of the research process – from foundational data and sophisticated workflows to transformative insights – are treated as valuable knowledge assets. The scientists and institutions who create these assets must strategically manage them to maximize their diverse forms of value. This includes establishing clear frameworks for knowledge ownership, data governance, responsible licensing, and ethical commercialization where appropriate. The aim is to ensure that insights are not only generated but are also effectively translated into broader intellectual advancements, societal benefits, or economic opportunities, with benefits flowing back to support further research and innovation.

A Diversified, Agile, and Sustainable Funding Ecosystem: While core public funding for foundational (“blue sky”) research remains essential, Insight Entrepreneurship fosters a more diversified and resilient funding landscape. This includes promoting more dynamic and transparent collaborations with industry, encouraging philanthropic investment in bold ideas, and exploring models where the value generated from knowledge assets (e.g., through licensing or spin-offs) can create sustainable revenue streams to support ongoing research, reducing sole reliance on fluctuating government appropriations.

Cultivating Intellectual Freedom and Robust Scientific Discourse: A core goal of Insight Entrepreneurship is to foster an environment resilient to ideological capture and conducive to healthy, rigorous, and open debate on all scientific questions, including those that are socially or politically sensitive. By empowering individual researchers with greater ownership of their intellectual trajectory and demanding accountability based on evidence and integrity rather than conformity, the system becomes more decentralized. This decentralization, coupled with a renewed commitment from the scientific community itself to uphold principles of free inquiry and viewpoint diversity (within the bounds of ethical and evidence-based practice), can provide a stronger defense against the imposition of orthodoxy, “runaway ideologies,” or self-censorship. The “market” for insights, in this broader sense, should ultimately favor those that are most robust, evidence-based, and generative, regardless of their alignment with prevailing dogma.

Principled and Impactful Communication: Research communication must be didactic, aiming to inform and educate both the scientific community and the public with clarity and unimpeachable honesty. A distinction is maintained between the often messy, complex reality of the research process and the focused, verifiable claims made in its communication. In an era of Insight Entrepreneurship, where the perceived value of insights is critical, ensuring that communications are both compelling and scrupulously honest is fundamental to building and maintaining trust.

The Role of Advanced Research Platforms in Enabling Research Entrepreneurship

The vision of Insight Entrepreneurship, with its emphasis on researcher ownership, demonstrable integrity, and the strategic management of complex knowledge assets, necessitates advanced infrastructure. DataJoint emerges as a principal tool and foundational platform designed to empower individual researchers and teams to thrive in this new era. It enables them to meticulously create, preserve, and demonstrate their key capabilities, methods, and outputs under conditions of operational excellence and data integrity.

Specifically, DataJoint facilitates:
- Creation and Preservation of Intellectual Assets: It provides a robust framework for systematically defining, executing, and evolving complex research workflows, treating the logic, the acquired data, and analytical results as interlinked, versioned assets. This inherently documents the research process, securing intellectual property and ensuring clear provenance for every insight.
- Operational Excellence and Data Integrity: Through standardized data models, automated processing pipelines, and support for FAIR (Findable, Accessible, Interoperable, Reusable) principles, DataJoint underpins the operational excellence required for modern, data-intensive science. This systematic approach is fundamental to ensuring the integrity and reliability of the knowledge assets generated.
- Demonstrable Integrity and Verifiability: DataJoint’s architecture enables communicated findings to be directly and transparently traced back to their origins within the managed research pipeline, including specific data versions, analytical methods, and computational steps. This traceability is crucial for upholding the principle of radical integrity and allowing for independent verification under agreed-upon protocols. DataJoint enables research teams to efficiently implement and certify “Gold Standard Science.”
- Strategic Management and Collaboration: It allows for granular control over data and workflow components, enabling researchers and institutions to strategically manage their intellectual capital – sharing elements with global partners to foster collaboration while protecting core proprietary assets. This supports the clear delineation and enforcement of governance and ownership structures vital for both academic and commercial translation of insights.
By providing such comprehensive capabilities, platforms like DataJoint empower insight entrepreneurs to not only generate groundbreaking discoveries but also to manage them as durable, verifiable, and valuable contributions, thereby meeting the highest standards of scientific rigor and public accountability.

The Goal: A Revitalized Scientific Enterprise

Insight Entrepreneurship aims to revitalize the scientific enterprise by fostering a culture rooted in individual initiative, profound integrity, and the strategic management of valuable knowledge. It seeks to build a system that is more agile, more accountable, sustainably funded, and deeply impactful. By empowering scientists as entrepreneurs of insight, responsible for both the creation and stewardship of knowledge assets, this vision endeavors to restore a nuanced public trust, ensure science’s intellectual leadership, and unlock new frontiers of discovery for the benefit of all.

This concludes the three-part series, Entrepreneurs of Insight. It begins with Part 1, A New Course for Scientific Discovery.
August 21, 2025

Insights & Ideas
The Power of Schemas

This article is Part 2 of our three-part series, AI Needs Data Discipline. In Part 1, we explored schema-on-write vs. schema-on-read and how hybrid systems emerged. Here we turn to the mathematical foundations: how relational models, entity-relationship diagrams, and schemas express relationships more powerfully than metadata alone. In Part 3, we’ll examine how these foundations must evolve for modern AI-driven challenges.

The structured data approaches, particularly the schema-on-write philosophy we discussed in our previous post, weren’t born out of a desire for corporate rigidity. Their origins are deeply rooted in mathematical rigor and the quest for expressive, provable methods of managing data.

The Mathematical Bedrock of Order

The intellectual lineage of structured data traces back to 19th-century mathematicians like De Morgan, Boole, and Cantor, who formalized logic and set theory. These mathematical tools laid the groundwork for the relational data model, which was formalized by Edgar F. Codd in the late 1960s and early 1970s. Codd’s model was a direct application of Set Theory and predicate logic to data management. Designing schemas was no longer like intuitively nailing boards together; it’s akin to using precise engineering principles—physics, material science, geometry—to design a bridge, ensuring its stability and longevity through provable calculations.

Before the relational model, data systems like hierarchical and network models often embedded relationships directly within data records. While functional, they could be complex to query and lacked a strong theoretical basis for data independence and integrity. Codd’s innovation was to represent data as mathematical “relations” (visualized as tables), where relationships are expressed through shared values (keys) rather than physical pointers. This offered a clear, declarative way to define data structures (schemas), enforce constraints, and query data using logical operations. The goal was precision and consistency, not arbitrary inflexibility.

Building on this, Peter Chen introduced the Entity-Relationship Model (ERM) in 1976. While Codd provided the mathematical underpinnings, Chen’s ERM offered a more intuitive, conceptual way to design databases. ERM focuses on identifying “entities” (e.g., ‘Customers,’ ‘Products’) and the “relationships” between them (e.g., a ‘Customer’ places an ‘Order’). Entity-Relationship Diagrams (ERDs) became a standard graphical tool to visualize these, acting as a blueprint before database implementation. It’s important to note that the relational model and ERM are the foundational principles, while SQL (Structured Query Language) is the common language used to implement these principles in databases.

Metadata Implies Relationships whereas Schemas Express and Enforce Them

So, how do we truly “understand” the relationships within data? One could argue this understanding is crystallized through the act of constructing a schema.

Metadata, or “data about data,” is incredibly valuable. It provides context, aids discoverability, and tracks provenance. For instance, metadata is like tagging a passenger with her destination and her luggage with her name. This provides useful context for her journey.

A formal schema, on the other hand, expresses and enforces these relationships as an intrinsic, verifiable part of the data system that supports an enterprise. Continuing our travel analogy, the schema is what guarantees the passenger her assigned seat on the correct flight and ensures her luggage makes the correct flight transfers. Foreign key constraints within a schema don’t just describe a link; they actively prevent operations that would violate that link, ensuring referential integrity is maintained by the database itself. This active enforcement provides a far stronger guarantee of consistency than descriptive metadata alone.

While the relational model provides a powerful foundation, how does it fare against the scale and complexity of modern data, especially with the rise of AI? In Part 3 of AI Needs Data Discipline, “AI and the Evolution of Relational Schemas,” we’ll explore these challenges and why the need for structure persists.

August 19, 2025

Insights & Ideas
Restoring Gold Standard Science
This article is Part 2 of our three-part series, Entrepreneurs of Insight. In Part 1, we explored how the post-WWII compact between science and society has collapsed. Here we examine the Executive Order on “Restoring Gold Standard Science” – its rationale, benefits, and risks. In Part 3, we introduce a new vision: Insight Entrepreneurship.

“Restoring Gold Standard Science”

A significant governmental response to the perceived crisis in science arrived with the Executive Order issued on May 23, 2025, “Restoring Gold Standard Science.” Understanding its rationale, approach, and likely impacts is crucial for charting a more effective path forward for the scientific enterprise.

Stated Rationale and Aims of the Executive Order: The EO directly cites a significant fall in public confidence in scientists, a reproducibility crisis acknowledged by researchers themselves, and high-profile data falsifications. It argues that the Federal Government has contributed to this loss of trust, providing examples such as allegedly misleading COVID-19 school guidance, flawed environmental projections (National Marine Fisheries Service), the controversial use of certain climate change scenarios, and the politicization of science through initiatives like Diversity, Equity, and Inclusion (DEI) in science planning under the prior administration. Its stated purpose is to restore a “gold standard” for science, ensuring federally funded research is transparent, rigorous, and impactful, and that Federal decisions are informed by credible, reliable, and impartial scientific evidence. The EO aims to restore scientific integrity policies of a previous administration and thereby rebuild the American people’s faith in the scientific enterprise.

Critique of the “Restoration” Premise and its Vision: A core tenet of the EO is the “restoration” of a supposed former ideal of scientific practice. However, this premise is problematic. The “gold standard” it seeks to reinstate may be an idealized or romanticized view of a past that never quite existed in such a pristine form, or one that is ill-suited to the complexities of 21st-century science. Science’s relationship with society has always been dynamic and often contested. More importantly, the EO, by its very nature as a corrective and restorative measure, does not offer a new, forward-looking vision for science. It is primarily a framework of control and compliance, focused on rectifying perceived past errors through prescribed standards and procedures. While principles like rigor and transparency are vital, the EO’s approach risks defining them too narrowly or instrumentally, potentially leading to a bureaucratic and defensive posture within the scientific community rather than fostering a proactive culture of innovation and intellectual leadership.

The EO’s impact, while aiming for positive outcomes like enhanced reproducibility and transparency, is likely to be mixed and carries significant risks:

Potential Benefits:

If implemented judiciously, a heightened focus on data transparency, clear articulation of uncertainties, and rigorous peer review could address some valid concerns about scientific practice, particularly in regulatory science.

Significant Risks and Limitations:
- Stifling Innovation: The prescriptive nature of “Gold Standard Science,” with its detailed mandates, could lead to a compliance-driven research environment that discourages novel, high-risk, or unconventional avenues of inquiry – the very approaches that often lead to breakthroughs.
- Politicization of Science Policy: The EO itself is a political document, explicitly reversing policies of a “prior Administration” and targeting specific scientific examples through a particular lens. This sets a precedent for science policy to oscillate with changing administrations, undermining the stable, long-term frameworks necessary for scientific progress. The mechanisms for defining and enforcing “impartiality” or identifying “highly unlikely assumptions” could themselves be wielded for political ends, subtly directing research away from disfavored topics (e.g., certain aspects of climate science or public health).
- Increased Bureaucracy: Implementing and monitoring compliance with the EO’s detailed requirements across all agencies could create significant administrative burdens, diverting resources and time from research itself.
- Focus on Process Over Outcome: An overemphasis on adherence to prescribed processes might not, by itself, guarantee better scientific outcomes or restore deep public trust, which is also built on demonstrated societal benefit and engagement.
- Narrowing of Scientific Inquiry: The EO’s critique of certain scenarios or its discouragement of DEI considerations could inadvertently narrow the scope of scientific inquiry or discourage research into complex societal challenges where such factors are relevant.
Underlying Social and Scientific Trends: The emergence of such an Executive Order is not an isolated event but is underpinned by broader social and internal scientific trends. These include a documented decline in public trust in many institutions (including science and academia), heightened political polarization where scientific findings are often weaponized, and the rapid spread of misinformation that can erode the authority of scientific consensus. Internally, the scientific enterprise has faced valid criticisms regarding reproducibility, transparency, and, as discussed previously, instances where academic communities have struggled to maintain healthy, open debate on contentious issues, sometimes appearing insular or ideologically uniform to segments of the public. These internal vulnerabilities and the external climate of skepticism can create an environment where directive interventions like the 2025 EO are seen by some as necessary or politically opportune.

This Executive Order, therefore, represents a critical juncture. While it responds to some legitimate concerns, its top-down, regulatory, and potentially politicized approach is unlikely to provide the adaptive, resilient, and genuinely empowering framework that science needs to thrive and effectively serve society in the future. This underscores the urgent need for a more fundamental rethinking of the scientific enterprise, driven from within the community itself.

‍Continue reading in Part 3 of the series, Insight Entrepreneurship – A New Vision for Science.
August 14, 2025

Insights & Ideas
A New Course for Scientific Discovery

This article is Part 1 of a three-part series, Entrepreneurs of Insight. In this opening installment, we examine how the old compact between science and society has broken down. In Part 2, we look at the promise and peril of the recent Executive Order on Restoring Gold Standard Science.

A New Course for Scientific Discovery

The historic compact that defined the post-World War II scientific era – characterized by broad public trust in a largely autonomous academic enterprise sustained by generous, relatively unfettered public support – has demonstrably run its course.

Shifting social trends, evolving public expectations, questions regarding the reliability and utility of some academic outputs, and the sheer scale and expense of modern scientific challenges have eroded this old consensus. Science’s principal role is not merely to respond to societal requests or fulfill pre-defined national priorities, but to proactively generate new insights, chart new intellectual directions, and fundamentally expand the horizons of understanding. Simultaneously, the prospect of replacing the old model with new forms of restrictive control, potentially prioritizing narrow agendas over foundational inquiry, threatens to stifle this very creativity, independence, and intellectual leadership that allow science to flourish and genuinely serve society.

Neither a return to an idealized past nor the imposition of overly restrictive external controls offers a viable path forward. A new vision is required – one that empowers the scientific community to proactively address these challenges, rebuild trust through demonstrable integrity and value creation, and ensure science remains a dynamic engine of human understanding and progress.

Continue reading in Part 2 of the series, Restoring Gold Standard Science.

August 7, 2025

Insights & Ideas
The Great Data Debate

This article is Part 1 of our three-part series, AI Needs Data Discipline. In this opening installment, we explore the enduring debate between schema-on-write and schema-on-read, and how modern systems blend the two in data lakes, warehouses, and lakehouses. In Part 2, we’ll trace the mathematical foundations of structure and why they matter for AI.

The Great Data Debate

The Enduring Tension: Schema-on-Write vs. Schema-on-Read

Long before AI became a ubiquitous topic, the data management world was already grappling with a fundamental question: when and how should we define the structure of our data? This debate centers on two primary philosophies: “schema-on-write” and “schema-on-read.”

The traditional schema-on-write approach, familiar to anyone who has worked with relational databases, demands that we define the data’s blueprint—its fields, data types, and relationships—before any data is stored. Imagine meticulously designing an architectural plan before laying a single brick. This ensures consistency and predictability, as every piece of data has a designated place. However, this upfront design effort can make it challenging and costly to adapt if data formats need to change rapidly.

Conversely, schema-on-read postpones defining structure until the data is actually queried. This approach gained significant traction with the NoSQL movement in the 2010s, driven by the need to handle the massive, diverse datasets (“Big Data”) generated by web applications. Schema-on-read allows for the quick ingestion of varied data types, offering great agility. Think of this as gathering all your building materials on-site and figuring out the assembly each time you need a structure. While flexible, this can sometimes lead to inconsistencies, as the interpretation of structure happens at the point of use.

The choice often reflects a project’s ambition: unstructured approaches might let you build many simple cabins quickly, but only a planned, structured approach can yield a Burj Khalifa.

The Hybrid Middle Ground: Data Lakes, Warehouses, and Lakehouses

In today’s complex data landscape, many organizations don’t rigidly stick to one extreme. Instead, a hybrid approach is common. Raw data in its myriad formats (structured, semi-structured, unstructured) is often rapidly ingested into a data lake, a schema-on-read staging area. This “capture everything” strategy is great for speed and flexibility.

However, when it comes to reliable analytics, reporting, and especially AI applications that demand high data quality, a transformation often occurs. Selected data from the lake is cleaned, validated, and loaded into a structured, schema-on-write system like a data warehouse.

The more recent concept of the data lakehouse attempts to merge the best of both worlds. It aims to provide the flexible storage of a data lake with the robust data management, governance, and query performance features of a data warehouse, all within a single platform. Even in these advanced hybrid systems, a crucial principle often holds: while initial ingestion might be flexible (leaning towards schema-on-read), a subsequent step typically applies structure (schema-on-write) to ensure integrity for critical downstream applications, including AI.

Why did the structured approach gain prominence in the first place? Part 2 of AI Needs Data Discipline, “The Power of Schemas,” describes the historical and theoretical foundations that make structured data so powerful and why it’s more than just a preference for order.

August 5, 2025

Insights & Ideas
The New Scientific Enterprise
DataJoint and the Future of Knowledge Creation

The world of scientific discovery is in flux, undergoing a transformation that redefines not only how research is conducted but also how its intellectual fruits are managed on a global scale. To understand the current juncture, it’s insightful to consider the historical trajectory of basic research across leading scientific nations.

Before World War II, while universities globally, particularly in Europe, possessed strong traditions of fundamental inquiry, a significant portion of pioneering fundamental science also emerged from industrial laboratories in technologically advancing countries. In the United States, for instance, Bell Labs, an industrial research powerhouse, was the birthplace of the transistor in 1947—a discovery that, while occurring just post-war, was the fruit of pre-war and wartime research trajectories typical of an era where industry played a leading role in certain types of foundational science. Similarly, DuPont’s development of nylon in the 1930s, a groundbreaking synthetic material, stemmed from its corporate commitment to long-term chemical research. These examples illustrate an era where industrial investment was critical for certain capital-intensive, foundational breakthroughs, often in fields aligned with their commercial interests, yet profoundly expanding scientific frontiers.

The landscape experienced a seismic shift following World War II, with different nations adapting their science policies. In the United States, this was famously catalyzed by influential reports like Vannevar Bush’s 1945 “Science, the Endless Frontier.” This advocated for massive, sustained federal investment in basic research, primarily through universities, as an engine for national progress. The subsequent establishment of agencies like the National Science Foundation (NSF) and the expansion of the National Institutes of Health (NIH) exemplified this strategy. This model, emphasizing publicly funded, university-based research, led to an explosion of academic discoveries, such as the elucidation of the DNA double helix structure in 1953 by Watson and Crick (supported by a mix of academic, philanthropic, and Medical Research Council funding in the UK), or the development of the laser and maser by Charles Townes and others (with university research often backed by government contracts). Other leading nations also significantly increased state funding for science, developing their own models that shaped the global scientific ecosystem and intensified both international collaboration and competition.

As a result, universities and dedicated public research institutions worldwide became principal theaters for early-stage investigation. While leading corporations globally didn’t entirely abandon basic research—many maintained formidable research divisions pursuing strategic long-term goals—their relative proportion of national basic research efforts often changed. The focus of major public investment shifted towards academia for broad, non-directed inquiry.

Today, academic and public research institutions remain the cornerstones of basic research. However, this model is profoundly tested. Public funding, while substantial, often struggles to keep pace with the escalating costs, complexity, and sheer scale required for cutting-edge science. A stark illustration of this trend is the development of modern large AI models. The foundational algorithmic breakthroughs and initial advancements in deep learning often emerged from academic settings, driven by researchers like Geoffrey Hinton, Yann LeCun, and Yoshua Bengio. However, the journey to the current state of powerful large language and generative AI models required computational resources, engineering teams, and datasets at a scale far beyond typical academic grant capabilities. It took trillion-dollar valuations driven by venture backing and massive corporate investment to achieve this, demonstrating that some areas of what is effectively basic research now demand financial and infrastructural scale previously unimagined for academic settings.

This challenge is not unique to AI. For instance, discussions at pivotal meetings, such as the November 2024 NIH NeuroAI Workshop, which I had the privilege to join, have highlighted how advancing neuroscience to understand the brain’s immense complexity requires scalability in data acquisition, processing, and model-building not unlike that seen in leading AI companies. Participants noted that the traditional federal grant system is often poorly adapted to support the massive, collaborative, and computationally intensive joint studies of neuroscience and AI needed for transformative breakthroughs.

This evolving global context necessitates a “new knowledge capitalism.” Basic science will likely continue to germinate primarily within academic labs, but their modes of integration with industry and, critically, their knowledge management approaches, must evolve. This new model involves recognizing and strategically managing all outputs of research—not just publications, but the intricate scientific workflows and the vast, curated acquired data—as valuable research assets.

This upcoming transformation demands deliberate focus on knowledge ownership, licensing, and comprehensive data governance. As research becomes more collaborative and its outputs more complex and valuable, clarity on these issues is paramount. Academic institutions, departments, individual labs, and the researchers themselves will require and expect clear definitions of their respective ownership shares and usage rights for both intellectual property (methods, software, inventions) and the data assets generated. Robust data governance frameworks covering both the data itself (its quality, security, accessibility, and provenance) and the methods used to produce and analyze it will become central to the scientific enterprise.

Research Communications in the New Knowledge Capitalism

The imperatives of this new scientific enterprise also reshape our approach to research communication. It’s crucial to distinguish that the ideal methods for conducting science often differ significantly from those for effectively communicating science. The process of discovery is frequently complex, iterative, involving vast amounts of data, numerous analytical paths explored, and many supporting details. In contrast, research communication—be it in publications, presentations, or public discourse—thrives on simplicity, a clear focus on the main findings, and meticulously presented controls that substantiate these claims. Extraneous detail, while vital during the research process, can detract from the core message in its communication.

The primary purpose of research communication is didactic; it aims to inform, educate, and allow the broader scientific community and society to understand and build upon new knowledge. Paramount to this is scientific integrity. Findings must be presented transparently and honestly, allowing for scrutiny and verification.

This is where a platform like DataJoint plays a dual role. While it manages the full complexity of research conducted by teams—the intricate pipelines, diverse datasets, and varied computational explorations—it also provides the tools to support clear and responsible communication. DataJoint allows researchers to precisely extract specific cross-sections of data, analyses, and workflow components that directly underpin the scientific findings being communicated. This enables the creation of focused narratives supported by verifiable evidence.

However, this capability must be wielded with profound scientific caution. The ability to select specific data or analytical approaches for presentation carries an inherent risk of bias, such as p-hacking (selectively reporting statistically significant results while ignoring non-significant ones) or cherry-picking data or methods that inappropriately inflate findings. In the new knowledge capitalism, where the perceived value of research assets is critical, maintaining the highest standards of integrity in communication is more important than ever.

DataJoint can contribute to mitigating these risks, not by policing thought, but by enabling more transparent and accountable practices:
- Traceability: Communicated findings can be directly traced back to their origins within the managed DataJoint pipeline, including specific data versions and computational steps.
- Contextual Record: The full pipeline, potentially encompassing a broader range of analyses than what is presented in a concise communication, can serve as a comprehensive record. While not always fully public due to IP considerations, this underlying structure can support deeper dives by collaborators, reviewers (under appropriate agreements), or for internal validation.
- Support for Rigor: By structuring the research process, DataJoint encourages systematic exploration and documentation, which, when combined with ethical research practices and pre-registration of analysis plans where appropriate, can help ensure that communicated results are robust and represent the science faithfully.
Ultimately, while tools like DataJoint provide powerful capabilities for both conducting and communicating science, the responsibility for ethical conduct and transparent reporting rests with the researchers and the scientific community. In this new era, where scientific assets have explicit value, ensuring that communications are both compelling and rigorously honest is fundamental to building trust and fostering genuine progress.

DataJoint: Enabling the Transformation

DataJoint is at the forefront of this evolution, providing the framework for this new global scientific enterprise. We understand that for research entities to thrive, they need tools that facilitate both robust international collaboration and stringent protection and management of their unique assets:
- Develop, Extend, and Preserve Research Workflows and Data Assets with Secure IP Retention: DataJoint provides a robust framework for creating, managing, and evolving complex data pipelines, treating both the workflow logic and the acquired data as first-class, interlinked assets. This systematic approach inherently documents the research process. DataJoint’s architecture enables granular control, allowing labs to define what is shared and with whom—whether collaborators are domestic or international—ensuring core intellectual property and sensitive data remain secure and properly attributed.
- Achieve Seamless Compatibility with Global Partners (Data Ops, Research Ops, SciOps): DataJoint’s standardized data models and reproducible analyses align with rigorous operational demands worldwide, facilitating smoother international collaborations.
- Champion Reproducible Science While Strategically Managing Know-How and Data Assets: DataJoint enables transparency for validation while empowering institutions and researchers to protect core methods and valuable datasets, balancing openness with the strategic management of assets in a competitive global environment.
- Manage Knowledge Assets and Uphold FAIR Principles with Comprehensive Governance: DataJoint supports FAIR principles for both workflows and data. Crucially, it provides the infrastructure for implementing detailed data governance and defining knowledge ownership structures. Its relational database foundation and workflow management capabilities allow for precise tracking of contributions, data provenance, and processing steps, which is essential for delineating ownership and licensing rights among institutions, departments, labs, and individual researchers. This facilitates the clear articulation and enforcement of policies regarding data and IP sharing, usage, and commercialization.
Empowering the Future: International Collaboration, Governance, Security, and Global Competitiveness

The current scientific era is defined by unprecedented opportunities for international collaboration to tackle grand challenges. DataJoint provides a common operational language and robust data management infrastructure that allows geographically dispersed teams to work together seamlessly, share insights responsibly, and accelerate discovery.

However, this occurs within a landscape of intense global competition. Nations and institutions vie for scientific leadership. DataJoint supports this by enabling rigorous data governance and IP management that helps protect a research entity’s competitive edge while still allowing for controlled collaboration. Its capabilities are crucial for implementing clear knowledge ownership and licensing frameworks, ensuring that contributions are properly recognized and benefits fairly distributed according to predefined agreements between universities, labs, and researchers.

DataJoint is engineered to provide fine-grained access controls, auditable data and computation trails, and a secure environment. This allows entities to collaborate confidently, sharing necessary components of data and workflows while protecting proprietary elements critical to their strategic interests.

The future of science is global, collaborative, and competitive, with an increasing emphasis on the strategic management of knowledge and data assets, and a renewed focus on the integrity of its communication. DataJoint is committed to empowering researchers and institutions worldwide to navigate this new landscape successfully, providing the tools to conduct world-class science, strategically manage their intellectual assets with clear governance and ownership structures, and forge a more impactful, secure, and sustainable scientific future for all.

‍
August 2, 2025

Insights & Ideas
You Only Need Five Query Operators
Clarity in Complexity: Why DataJoint’s Five Query Operators Are All You Need

Navigating complex data demands tools that offer both power and clarity. DataJoint is designed for building and managing scientific data pipelines. The upcoming release of DataJoint Specs 2.0 marks the first time DataJoint will be developed against a formal, open specification document, embodying the philosophy of an open standard and capturing its core theoretical concepts to ensure consistent implementations across different platforms.

While many are familiar with SQL, the lingua franca of relational databases, DataJoint’s query language, as defined in these new specs, employs a remarkably concise set of just five core operators. This naturally begs the question: in a world accustomed to SQL’s extensive vocabulary, can just five operators truly be enough? This article argues an emphatic “yes” – not despite their small number, but precisely because of their rigorous design and unwavering commitment to fundamental relational principles.

The Theoretical Bedrock, ERM’s Vision, and SQL’s Journey

To appreciate DataJoint’s approach, it helps to understand the foundations. Relational database theory, pioneered by Edgar F. Codd in the late 1960s/early 1970s, is built on rigorous mathematics. Codd introduced relational algebra, a procedural language where operators like selection, projection, and join manipulate tables (relations) to produce new tables. He also defined relational calculus, a declarative language allowing users to specify what data they want. Codd proved these two formalisms were equivalent in power, establishing the concept of relational completeness.

In 1976, a pivotal moment in conceptual database modeling arrived with Peter Chen’s introduction of the Entity-Relationship Model (ERM). Chen proposed a way to look at the world, and thus model data, in terms of “entities” (distinguishable “things” like a student or a course) and “relationships” between them (like a student “enrolling” in a course). The ERM provided a powerful visual language—ER diagrams—that became incredibly influential for database schema design and for communication between database designers, domain experts, and stakeholders. It offered an intuitive framework for translating real-world scenarios into structured data models, naturally leading to well-normalized schemas.

However, a significant disconnect emerged. While ERM became a standard for conceptualizing and designing databases, its elegant, entity-centric syntax and explicit relationship constructs were never directly mirrored in SQL’s Data Definition Language (DDL for creating tables) or its Data Query Language (DQL for retrieving data). SQL’s CREATE TABLE statements, while defining columns and foreign keys (which implement ERM relationships), don’t speak the direct language of “entity sets” and “relationship sets” in the way ERM diagrams do. Similarly, SQL’s JOIN syntax, while powerful, doesn’t inherently guide users to join tables based on the semantically defined relationships from an ERM perspective. This left a gap between the clarity of the conceptual design and the often more intricate, attribute-level syntax of SQL implementation and querying.

SQL itself emerged as a practical implementation drawing from both relational algebra and calculus. Its SELECT...FROM...WHERE structure has a declarative feel whereas JOIN is a relational algebra operator. A fascinating part of SQL’s early vision was its aspiration to be a natural language interface for databases, aiming for queries that read like English prose. While admirable, this came at the cost of the explicit operator sequencing and rigorous composability found in more formal algebraic systems.

SQL’s widespread adoption, fueled by successful standardization efforts, has been immensely beneficial. However, through its evolution, SQL accumulated “conceptual baggage”—layers of complexity and ambiguity that can obscure the underlying simplicity of relational operations.

The Cornerstone: Well-Defined Query Results

A central tenet of the DataJoint philosophy, crystallized in the new Specs 2.0, is that all data, whether stored in base tables or derived through queries, must represent well-formed entity sets (or relations). What does this mean in practice? It means that every table, including any intermediate or final result of a query, must:
- Clearly represent a single, identifiable type of entity (e.g., “Students,” “Experiments,” “MeasurementEvents”).
- Have a well-defined primary key – a set of attributes whose values uniquely identify each entity (row) within that set.
- Ensure that all its attributes properly describe the entity identified by that primary key.
This commitment is upheld through what the DataJoint Specs refer to as algebraic closure. Each of DataJoint’s query operators is designed such that if you give it well-formed relations as input, it will always produce another well-formed relation as output, complete with its own clear primary key and entity type.

This brings us to a critical question to keep in mind when working with SQL: Can you always tell, just by looking at the SQL statement, what real-world entity each row in the result is meant to represent, and what makes each row unique? Often, the answer becomes murky.

DataJoint’s “Fab Five”: A Modern Interface to Relational Power

With deep reverence for Codd’s foundational genius, it’s fair to ask if the original set of algebraic operators, defined over half a century ago, still constitutes the optimal user-facing interface for today’s data challenges. DataJoint, guided by its new Specs 2.0, proposes a refined, modern set of five operators designed for clarity and power:
1. Restriction (&, -): This is your precision filter. It selects a subset of rows from a table based on specified conditions without altering the table’s structure or primary key. The resulting table contains the same type of entities and the same primary key.
2. Projection (.proj()): This operator reshapes your view of a table by selecting specific attributes, renaming them, or computing new attributes from existing ones. Crucially, the primary key of the original table is preserved, ensuring the identity of the entities remains intact.
3. Join (*): This operator combines information from two tables. But it’s not just any combination; it’s a “semantic join” (more on this later) that ensures the resulting table represents a meaningful fusion of entities, with a clearly defined primary key derived from its operands.
4. Aggregation (.aggr()): This operator can be seen as an advanced form of projection. For a table A, A.aggr(B, ...) can perform the same functions as A.proj(...)by selecting, renaming, and calculating attributes. Additionally, it can also calculate new attributes for each entity in A by summarizing related data from table B. The beauty is that the resulting table still has A‘s primary key and represents entities of type A, now augmented with new information. Despite their similarity, we consider .proj and .aggr as distinct operators.
5. Union (+): This operator combines rows from two tables, A and B. For this to be valid, A and B must represent the same type of entity and share the same primary key structure; the result inherits this structure.
These five operators, through their strict adherence to producing well-defined results, form the backbone of DataJoint’s expressive power.

Untangling SQL: Where Simplicity Meets a Wall of Operators (Illustrated)

Let’s look at how DataJoint’s approach contrasts with SQL in practice.

SQL’s Operator Count – A Fuzzy Number

How many “operators” does SQL effectively have? It’s notoriously hard to quantify because many distinct logical operations are bundled into the complex SELECT statement. A single SELECT can perform filtering (like DataJoint’s restriction), column selection and computation (projection), table combination (join), grouping, and ordering, all intertwined.

Furthermore, seemingly simple modifiers in SQL can act like entirely new, transformative operators. Adding DISTINCT to a SELECT query doesn’t just remove duplicate rows; it fundamentally changes the resulting relation, implying a new primary key based on all the selected columns. Similarly, aggregate functions like COUNT() or AVG() within a SELECT statement, with and without a GROUP BY clause, transform the output into a new type of entity (e.g., “summary per department” instead of “employees”), with the grouping columns forming the new primary key. If we were to “unroll” every distinct transformation SQL can perform, the operator count would be vastly larger and far more entangled than DataJoint’s explicit five.

The SELECT Statement’s Hidden Logic

The order in which SQL clauses are written (SELECT, FROM, WHERE, etc.) doesn’t reflect their logical execution order. This “hidden logic” often confuses users, particularly regarding the scope of aliases defined in the SELECT list. DataJoint’s explicit, sequential application of operators avoids this ambiguity entirely.

The Labyrinth of SQL Joins vs. DataJoint’s Semantic Precision

SQL offers various join implementations like INNER JOIN, LEFT/RIGHT/FULL OUTER JOIN, and CROSS JOIN with various modifiers such as NATURAL, USING, and ON <condition>. Classical relational algebra also defined foundational concepts such as equijoin (joining based on equality, a specific type of a more general “theta join” which allows any comparison – though these terms are more academic than direct SQL syntax) and natural join, which have influenced SQL’s join logic. However, SQL’s NATURAL JOIN (matching on identically named columns) can be treacherous, as it may join attributes that share a name but have completely different meanings.

The ERM guided that meaningful joins should occur on foreign keys between related tables. DataJoint institutionalizes this with Semantic Matching for its one and only join (*) operator. For attributes to be matched, they must not only share the same name but also trace their lineage through an uninterrupted chain of foreign keys to the same original attribute definition. If identically named attributes don’t meet this criterion, it’s a “collision,” and DataJoint raises an error, compelling the user to explicitly rename attributes using projection before the join. This rigor, tied to the foreign key relationships typically visualized in a schema diagram, means the validity of a DataJoint join is often apparent from the schema structure itself.

Semijoin and Antijoin: The Illegitimate “Joins”

Speaking of joins, relational algebra textbooks often discuss semijoin and antijoin.
- A semijoin (A⋉B) returns rows from table A for which there is at least one matching row in table B (based on common attributes), but it only includes columns from table A.
- An antijoin (A▹B) returns rows from table A for which there are no matching rows in table B, again only including columns from table A.
While these are powerful, the “join” in their names is quite a misnomer. True joins combine attributes from both participating tables to form a new, wider entity and create new entities by pairing rows from the joined tables. Semijoins and antijoins, however, don’t do this. They fundamentally act as filters on table A based on the existence (or non-existence) of related records in table B. The structure of table A (its attributes and primary key) remains unchanged; you merely get a subset of its rows. This is precisely the definition of a restriction operation.

In SQL, these operators are implemented in a wide variety of ways, typically using a subquery with EXISTS, NOT EXISTS, IN, and NOT IN operators, or as an inner join followed by a GROUP BY or by using a DISTINCT modifier.

The DataJoint Specs 2.0 acknowledge the true restriction-like nature of these operators directly: when performing a restriction by a subquery (which is conceptually how one table filters another), “The restriction acts as a semijoin (for &) or an antijoin (for -)” . The earlier DataJoint manuscript (Yatsenko et al., 2018) also explicitly deprecated the terms “semijoin” and “antijoin” as misleading and confusing. DataJoint thus correctly categorizes these operations under its versatile Restriction operator, avoiding the potential confusion of SQL needing EXISTS, NOT EXISTS, IN, or NOT IN subqueries to achieve similar effects, which can feel less direct than a simple restriction.

SQL’s OUTER JOINs: A Mix of Meanings

SQL’s OUTER JOIN variants (like LEFT JOIN) often create results that are a jumble of entity types. Some rows might represent a complete pairing, while others represent only one entity, padded with NULLs. After such an operation, can you confidently tell what real-world entity each row in the result is meant to represent?

DataJoint’s Specs 2.0 clearly state its approach: it effectively has no direct “outer join” operator because such an operation typically violates the principle of yielding a single, well-defined entity set with a consistent primary key. Instead, DataJoint’s aggr operator cleanly achieves the common goal of augmenting one entity set with summaries from another, preserving the primary entity’s type and identity.

Redundancy in Restriction in SQL

SQL uses multiple clauses for what amounts to filtering: WHERE, ON (in joins), HAVING (for groups), and LIMIT / OFFSET (for result sets). DataJoint streamlines this with its single, powerful Restriction operator (& and its complement -).

Illustrative Examples: DataJoint vs. SQL

Let’s make these differences more concrete. (Imagine a simplified university database with Student, Course, Section, Enroll, StudentMajor, and Grade tables.)

1. Finding Students Enrolled in Any Class
- DataJoint:
```
Student & Enroll 
```
- Result: A well-defined set of Student entities.
- SQL:
```
SELECT *
FROM Student
WHERE student_id in (SELECT student_id FROM Enroll);
```
Result: Rows from the Student table, but the logic is more verbose.

2. Counting Enrolled Students per Section
- DataJoint:
```
Section.aggr(Enroll, n_students='COUNT(*)')
```
Result: Section entities, augmented with n_students.
- SQL:
```
SELECT sec.*, COUNT(e.student_id) AS n_students
FROM Section AS sec
LEFT JOIN Enroll AS e USING (course_id, section_id)
GROUP BY course_id, section_id
```
Result: Requires explicit join and grouping by all parts of Section‘s primary key.

The DataJoint Advantage: Why These Five Excel

DataJoint’s design philosophy demonstrates that true power doesn’t come from a multitude of overlapping commands, but from a concise set of orthogonal, well-defined operators that compose reliably.
- Consistently Well-Defined Results (Algebraic Closure): Every operation yields a predictable, valid table with a defined primary key and entity type.
- Semantic Precision: Binary operations like join are based on meaningful relational links, not just coincidental name matches.
- Composability: Simple, reliable steps can be combined to build sophisticated queries.
- Interpretability: The nature of the data remains clear at every stage of the query.
- Entity-Oriented Focus: The operators encourage thinking in terms of whole entities and their relationships, aligning well with conceptual modeling principles championed by the ERM as opposed to the attribute-oriented focus of SQL.
Conclusion: A Clearer Lens for Data Discovery

SQL’s position as a foundational data language is secure, and its contributions are undeniable. However, for the complex, high-stakes data work found in scientific research and other demanding domains, a query interface that prioritizes conceptual clarity, predictability, and semantic integrity can be transformative.

DataJoint, as guided by its new Specs 2.0, isn’t about minimalism for its own sake. It’s about providing a complete and conceptually sound set of query operators that empower users. By ensuring every operation results in a well-defined entity set and by enforcing semantic integrity in operations like joins, DataJoint aims to strip away ambiguity and allow researchers to interact with their data with greater confidence and insight. It’s a compelling case that sometimes, to see further, we need not more tools, but clearer lenses.

‍
June 6, 2025

Insights & Ideas

A 1-Minute Overview of Neuropixels

Quick hits: what Neuropixels make possible

Who built it

From 1.0 to Opto: The Probe Line-Up

The Three-Step Guide to Choosing a Probe

From Acquisition to Sorting: The Open Toolchain at a Glance

DataJoint in Action: What You Get

Open-Source DataJoint Pipeline at hand

Rigor and Transparency

Looking Ahead

AI and the Evolution of Relational Schemas

Why AI Still Needs a Backbone of Integrity

Evolving the Model for Modern Data Challenges

DataJoint: A Modern Example of Structured, Computable Data Management

Introducing: Insight Entrepreneurship

The Mathematical Bedrock of Order

Metadata Implies Relationships whereas Schemas Express and Enforce Them

“Restoring Gold Standard Science”

A New Course for Scientific Discovery

The Great Data Debate

The Enduring Tension: Schema-on-Write vs. Schema-on-Read

The Hybrid Middle Ground: Data Lakes, Warehouses, and Lakehouses

DataJoint and the Future of Knowledge Creation

Research Communications in the New Knowledge Capitalism

DataJoint: Enabling the Transformation

Empowering the Future: International Collaboration, Governance, Security, and Global Competitiveness

Clarity in Complexity: Why DataJoint’s Five Query Operators Are All You Need

The Theoretical Bedrock, ERM’s Vision, and SQL’s Journey

The Cornerstone: Well-Defined Query Results

DataJoint’s “Fab Five”: A Modern Interface to Relational Power

Untangling SQL: Where Simplicity Meets a Wall of Operators (Illustrated)

SQL’s Operator Count – A Fuzzy Number

The SELECT Statement’s Hidden Logic

The Labyrinth of SQL Joins vs. DataJoint’s Semantic Precision

Semijoin and Antijoin: The Illegitimate “Joins”

SQL’s OUTER JOINs: A Mix of Meanings

Redundancy in Restriction in SQL

Illustrative Examples: DataJoint vs. SQL

1. Finding Students Enrolled in Any Class

2. Counting Enrolled Students per Section

The DataJoint Advantage: Why These Five Excel

Conclusion: A Clearer Lens for Data Discovery

The `SELECT` Statement’s Hidden Logic

SQL’s `OUTER JOIN`s: A Mix of Meanings