Large generative models (e.g., large language models, diffusion models) have shown remarkable performances, but their enormous scale demands significant computation and memory resources. To make them more accessible, it is crucial to improve their efficiency. This course will introduce efficient deep learning computing techniques that enable powerful deep learning applications on resource-constrained devices. Topics include model compression, pruning, quantization, neural architecture search, distributed training, data/model parallelism, gradient compression, and on-device fine-tuning. It also introduces application-specific acceleration techniques for large language models, diffusion models, video recognition, and point cloud. This course will also cover topics about quantum machine learning. Students will get hands-on experience deploying large language models (e.g., LLaMA 2) on the laptop.