The Hugging Face dataset by matlok provides a comprehensive overview for training multimodal Python copilots. It includes ~2.3M unique source coding rows, ~1.1M instruct alpaca yaml text rows, ~923K png knowledge graph images, and ~334K mp3s, requiring 1.5 TB of storage. This resource is designed to aid in creating and sharing large datasets for AI development, featuring detailed information on dataset composition, schema design, and usage examples across source code, text, image, and audio data. For further details, please visit the Hugging Face dataset page.
Here's the summary (everything is in parquet files):
~2.3M unique source coding rows
~1.1M instruct alpaca yaml text rows
~923K png knowledge graph images with alpaca text description
~334K mp3s with alpaca and different speaker for questions vs answers
requires 1.5 TB storage on disk