Chinese Internet Corpus 3.0 Released: 120GB for AI Development
Chinese Internet Basic Corpus 3.0 Launches with 120GB Dataset
Kunming, September 18, 2025 — The Chinese Internet Basic Corpus 3.0, a massive 120GB dataset, was officially unveiled during the AI Security Governance Forum at the 2025 National Cybersecurity Awareness Week. This release marks a significant milestone in providing high-quality data for large-scale AI model training and advancing artificial intelligence research.
Collaborative Development Under Government Guidance
The corpus was developed through a partnership between the China Cybersecurity Association and the National Internet Emergency Center, with oversight from the Central Cyberspace Administration. The project leveraged contributions from enterprises, universities, and research institutions, utilizing the corpus co-construction and sharing mechanism established by the AI Security Governance Committee.
Image source note: The image was generated by AI, and the image licensing service provider is Midjourney.
Enhanced Data Quality and Scope
Compared to previous versions, Corpus 3.0 features:
- Expanded data sources for broader coverage.
- Strict processing measures, including source screening, content filtering, and deduplication.
- Improved reliability by filtering out illegal or harmful content.
The dataset aims to create a healthier environment for AI research and applications, ensuring developers have access to clean, credible data.
Accessibility and Future Plans
Researchers and developers can access the corpus by registering on the China Cybersecurity Association website and visiting the Chinese Internet Corpus Resource Platform. The project lead emphasized ongoing efforts to strengthen corpus development, supporting AI innovation and industrial growth.
The release underscores China's commitment to advancing AI technology through collaborative, high-quality data infrastructure.
Key Points:
- 120GB dataset: Designed for large-scale AI model training.
- Government-backed: Developed under Central Cyberspace Administration guidance.
- Enhanced quality: Rigorous processing ensures reliable data.
- Open access: Available via registration on official platforms.