Data Is the New Oil: How to License Datasets for AI Without Losing Control (Part 2 of 2)
Guest Article by Jim W. Ko of Ko IP & AI Law PLLC
In Part 1 of this article, we explained the differences between local vs. cloud licensing models and why cloud access is ideal from the licensor’s perspective but often unrealistic due to standard industry practices. We ended by highlighting the need for data licensors to mitigate the security risks that come with providing local copies of datasets to their clients. In Part 2, we will discuss security architecture and legal contracting strategies for mitigating the risk of misappropriation of your licensed datasets.
III. Security Architecture for Onsite Access: What Licensors Should Require
To mitigate the risk of post-license data misuse when providing a copy of the dataset to the licensee, licensors should incorporate a multi-layered security architecture. Each layer makes it incrementally harder for a licensee to retain or misuse the dataset and increases the likelihood of detection and legal enforcement. Key components for consideration include:
Encryption at Rest and in Transit: All dataset files should be encrypted, and encryption keys should be revocable upon termination.
Access Control and Audit Logging: Strict user-level permissions, immutable logging of data access, and real-time monitoring should be enforced.
Watermarking and Dataset Fingerprinting: Embed signals in your data that persist through training and can later be used to detect whether a model memorized or retained your dataset. These techniques, including methods like radioactive data,[1] inject subtle, statistically engineered patterns that leave detectable traces in model outputs—even after fine-tuning or augmentation.
Privacy-Preserving Model Training: When licensees are permitted to train models on the dataset, licensors may require the use of differential privacy or related techniques to reduce the risk that trained models memorize or reveal raw data, making it harder for malicious actors to extract sensitive information.[2]
Trusted Execution Environments (TEEs): Confidential computing enclave hardware-based solutions isolate sensitive data even from system administrators during training.[3]
Model Unlearning Protocols: While still emerging, unlearning algorithms may allow partial “reversals” of training effects when data use is revoked.[4]
Model Behavior Monitoring: Tools are emerging to test whether models regurgitate memorized data, helping detect misuse even if direct data extraction isn’t attempted.[5]
Blockchain for Licensing and Access Logging: Utilize blockchain or distributed ledger systems to create tamper-evident access logs, smart contract-enforced usage terms, and cryptographic proofs of dataset ownership and compliance.[6]
Each of these technologies comes (or will come) with its own cost, implementation burden, and legal implications. None are robust or mature enough to claw back data or force model forgetting, but they serve as valuable post-termination detection and enforcement mechanisms. Think of them not as containment tools, but as forensic triggers: they can’t prevent misuse, but they can help identify and prove it—especially when paired with strong contractual remedies and audit rights.
For licensors navigating the power asymmetry of data licensing in the GenAI era, the principle remains the same: trust but verify. That means designing the architecture—and the agreement—to assume good faith while still preparing for bad behavior.
IV. Legal Strategies: Contracts as Containment
Technology can slow down misuse. But ultimately, it’s the contract that draws the red lines—and the willingness to enforce them that keeps licensees honest.
If you're licensing high-value datasets for AI training, your agreement shouldn’t resemble a boilerplate software end-user licensing agreement (EULA) when what you’re licensing is the crown jewel of your AI value proposition. It should be purpose-built for the realities of data leakage, model memorization, and post-termination risk—and tailored to the architecture you choose.
Here are some key strategic considerations:
Purpose-limited use: Define exactly what the licensee is permitted to do—and nothing more.
Post-termination obligations: Set clear requirements for deletion, certification, and (where applicable) model handling.
Reverse engineering and misuse: Prohibit any attempt to extract or replicate the dataset from trained models.
Audit rights: Even if never exercised, they serve as a powerful deterrent.
Remedies with bite: Strong agreements support injunctive relief, damages, and attorneys’ fees.
Security expectations: Especially for onsite access, basic standards are necessary, including encryption, access control, and breach notice.
Watermark acknowledgment: If you're embedding detection mechanisms, reserve the right to use them—and prohibit tampering.
Blockchain-recorded compliance: Incorporate blockchain-based verification of access logs, license conditions, and dataset provenance as enforceable elements of the agreement.
Bottom line: Your architecture and your license should speak the same language—or risk talking past each other when it matters most. A contract without technical controls is wishful thinking. Technical controls without a contract are an invitation for exploitation.
V. Conclusion: Control Is a Design Choice—and a Legal Strategy
Licensing datasets for AI training is no longer a simple matter of “give and get.” In a world where models can memorize, regenerate, and distribute the data they’re trained on, the risk isn’t just that your dataset might be copied—it’s that it might be embedded.
That’s why dataset control in the AI era isn’t just about infrastructure. It’s about intentionality. Every decision—cloud vs. onsite access, watermarking, audit rights, model retention—is a tradeoff between usability and security, speed and certainty, collaboration and containment.
There is no one-size-fits-all architecture, and no contract clause that can substitute for good design. But with the right mix of legal terms, technical controls, and business judgment, licensors can protect their data assets without stifling innovation.
And if your dataset really is the new oil—then you need to treat it like a natural resource: valuable, finite, and worth protecting with everything you've got.
[1] See Alexandre Sablayrolles, et al., Radioactive Data: Tracing Through Training, Proceedings of the 37th International Conference on Machine Learning, PMLR 119:8326 (2020), available at https://proceedings.mlr.press/v119/sablayrolles20a.html.
[2] See Joseph P. Near, et al., Guidelines for Evaluating Differential Privacy Guarantees, National Institute of Standards and Technology, NIST Special Publication 800-226 (Mar. 2025), available at https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-226.pdf.
[3]See supra note 1
[4] See Ken Ziyu Liu, Machine Unlearning in 2024, Stanford AI Lab Blog (May 2024), available at https://ai.stanford.edu/~kzliu/blog/unlearning.
[5] See Buse Gul, Atli Tekgul & N. Asokan, On the Effectiveness of Dataset Watermarking in Adversarial Settings, arXiv:2202.12506 (Feb. 2022), available at https://arxiv.org/abs/2202.12506.
[6] See Primavera De Filippi & Aaron Wright, Blockchain and the Law: The Rule of Code (Harvard Univ. Press 2018).
About the Author
Jim W. Ko is a patent attorney and focuses his practice on providing counsel for all the ways that intellectual property and artificial intelligence (AI) issues can and will impact businesses. He lives in Chandler, Arizona.
At Outrigger Group, we provide fractional executive support to help you achieve your version of success. Whether you're scaling, pivoting, or refining your strategy, our experienced team is here to offer support without slowing you down. Reach out to info@outrigger.group if you want to start a conversation.