Near AI x HZN – The Importance of Data

Global Coin Research Team

July 18, 2024

Insights Partnership

No Comments

In our last article, we outlined Near’s vision to become the hub for User-Owned AI and the steps that it is taking to develop an AI ecosystem. In this piece, we begin to dive into how Near aims to do this with the Near AI x HZN program, its first ever all-AI accelerator cohort.

Near was very intentional in choosing the make-up of this first AI cohort. It was important to seed the ecosystem with what Near considers to be foundational pieces to the AI stack.

One of the key components to Near’s AI ecosystem is data. Reliable, high quality data is necessary to train effective AI models. Near chose two projects tackling synthetic data (Mizu) and user generated content (Ringfence) to address this need for training data.

Mizu

Mizu is building the first blockchain-based synthetic AI data layer enabling large-scale, verifiable datasets to overcome AI training data limitations. Mizu envisions a world where developers use community curated data-repos as the fundamental building blocks for developing applications – much like the existing paradigm where devs collaborate on code repos such as github.

Big tech may run out of real-world data to train AI models by 2026. Synthetic data, artificially created to mimic real-world data characteristics, offers a solution. It can be generated quickly, with specific properties, and addresses privacy concerns. Leading AI models like GPT-5 and Tesla FSD12 already rely heavily on synthetic data. Gartner predicts synthetic data will be primarily used over real data in AI models by 2030.

While Web2 synthetic data companies like Gretel.ai and Tonic.ai have secured significant funding, Mizu is pioneering the first Web3 synthetic data solution. Blockchain technology enhances transparency, aligns stakeholder incentives, and allows developers to build new pipelines on existing datasets.

Product

The MIZU Data Network enables developers to deploy and run LLM-driven data workflows transparently and securely.

Data Repo

There are a lot of open-sourced datasets out there today, e.g., CommonCrawl, HuggingFace. However, these datasets are static and can become stale. Mizu transforms static datasets into permissionless data repos where anyone can contribute. Each repo includes:

Smart Account: Each data repo is managed by a smart contract, allowing it to hold tokens and giving users permissions to manage the data and rules. The smart account serves as the foundation for the repo’s governance and access control.
Datasets: This is the actual data committed to the repo. The datasets can be in various formats and cover a wide range of domains, providing the raw material for AI applications and models.
Data Index: The data index provides specific views of the data, which can be subsets of the data or aggregated results. A validation rule can be applied to either the data or the data index, ensuring the quality and consistency of the information accessible through the index.
Validation Rules: The validation rules define what data can be committed to the repo. These rules can be descriptive, which will be validated by Large Language Models (LLMs), or program-based, which will be validated by pre-compiled validators. The validation rules ensure the integrity and reliability of the data within the repo.

The Mizu Data Network

The Mizu Data Network is a decentralized network where all nodes run a DataVM (Data Virtual Machine). The DataVM is similar to the Ethereum Virtual Machine (EVM) but is designed specifically for deploying and running data workflows instead of smart contracts. Developers can deploy and run customized data workflows which are compiled to atomic data operations or data tasks. These tasks will then be picked up and executed by Mizu data nodes within the network.

Trusted Data Engine

The Data Engine performs the real data generation and validation work which will support two types of data engines: the rule-based and the LLM-based.

Rule-based Simulator: This enables developers to generate synthetic data following some rules. This feature is particularly useful for creating simulated environments and datasets. For example, game engines like Unity and Unreal Engine can be used to generate simulated street view images or 3D models. Tesla’s Full Self-Driving (FSD) V12 heavily relies on simulators to generate street view pictures for training their autonomous driving models
LLM-based Engine: Developers can also call Large Language Models (LLMs) to generate, process, or annotate data directly. One example model/agent that can be used for this purpose is MetaGPT. There are multiple projects working on decentralized inference network(e.g. Ritual) which MIZU could integrate with.

Ringfence

Ringfence integrates blockchain technology to enable licensing, IP protection, and monetization for Art and Music clients, aiming to be the “DALLE-3” for Web3. Generative AI tools, trained on web-scraped images without creator consent, pose challenges for copyright, authenticity, and monetization. Ringfence combines AI and blockchain to address these issues through Ringfence NFTs (rNFTs).

rNFTs offer a new approach to digital asset management, enabling copyright compliance, ownership verification, and monetization. They set the standard for what Ringfence calls Complex Digital Assets (CDAs), which represent a portfolio of licensable assets.

The Ringfence platform aims to address three key challenges posed by the unauthorized use of source content in AI generative content:

Content Verification: Screens uploads against existing IP to prevent infringement.
Digital Provenance & Metadata Tagging: Tags assets with permanent metadata for transparent history.
Enforceable End-User Agreements: Uses smart contracts to set usage terms and compensation.

rNFTs offer flexible management, allowing temporary removal of component NFTs (cNFTs) while preserving their identity. This enhances user control within the Ringfence ecosystem.

cNFTs have varying values based on their origin, with original uploads generally worth more than AI-generated content. Users can grant one-time permission for AI training, with key metadata remaining immutable post-minting. rNFTs support various content types and include detailed metadata. Their dynamic nature allows creators to add new components over time.

rNFTs serve two key functions:

Data Assets: Help build neural networks with authorized data, enabling fair compensation for contributors.
IP Assets: Carry all IP associated with a collection, simplifying licensing.

Monetization options for rNFTs include:

Neural network training rewards
Licensing for commercial use
Branding and advertising
Future marketplace trading

Rinfence’s rNFTs protect creators in the AI era, offering creators versatile tools for managing their digital IP enabling fair monetization and simpler licensing across digital industries.

Seeding the Near AI ecosystem with Quality Data

With Mizu and Ringfence as key components of the Near AI ecosystem, any other AI projects will have access to rich sources of data to drive their models. As with most things – garbage in, garbage out – and AI is no different in this respect. As AI models become more complex and target more diverse use cases, access to the niche datasets that Mizu can provide will become more important. In addition, Ringfence’s solution incentivizes content creators to allow their IP to be used for model training which should lead to better generative models.

As we mentioned earlier, Near is focused on creating a hub of User-Owned AI. Data is just one component of this journey. In future pieces, we will cover other key pillars of the AI stack that Near’s HZN x AI cohort brings to the ecosystem.

This article has been written and prepared by the GCR Research team in collaboration with Near Foundation/Near Horizon. Committed to staying current with industry developments and providing accurate and valuable information, GlobalCoinResearch.com is a trusted source for insightful news, research, and analysis.

Disclaimer: Investing carries with it inherent risks, including but not limited to technical, operational, and human errors, as well as platform failures. The content provided is purely for educational purposes and should not be considered as financial advice. The authors of this content are not professional or licensed financial advisors and the views expressed are their own and do not represent the opinions of any organization they may be affiliated with.