Why now? The transformative potential of AI has been significantly advanced since Vaswani et al. (2017) introduced transformer neural networks, primarily designed for machine translation. This innovation led to the era of generative AI, characterized by diffusion models for image creation and large language models (LLMs) as deep learning networks programmed using natural language. Unlike traditional ML models, these LLMs do not require extensive training or tuning, heralding a new era of technological deployment and innovation. This process of programming language models using natural language to accomplish specific tasks is known as Prompt Engineering.
In 2023, Meta unveiled the Llama language models, including Llama Chat, Code Llama, and Llama Guard. These models represent the state-of-the-art in general-purpose LLMs and are available in various sizes:
Llama 2 Models
- llama-2-7b: Base pretrained 7 billion parameter model
- llama-2-13b: Base pretrained 13 billion parameter model
- llama-2-70b: Base pretrained 70 billion parameter model
- And several specialized chat and code fine-tuned models
Getting an LLM
Deploying large language models can be done through self-hosting, cloud hosting, or hosted APIs, each with its own advantages. Self-hosting offers privacy and security, cloud hosting provides customization, and hosted APIs are the most straightforward for beginners.
These are the simplest starting points for using LLMs. Key endpoints include:
- completion: Generates a response to a given prompt.
- chat_completion: Generates the next message in a message series, providing context for applications like chatbots.
LLMs process information in chunks called tokens, which are roughly equivalent to words. Each model has its tokenization scheme and a maximum context length that your prompt cannot exceed.
The guide includes a practical example using the Llama 2 chat model with Replicate and LangChain to set up a chat completion API.
Llama 2 models tend to be verbose, explaining their rationale. We'll explore how to manage response length effectively.
Chat Completion APIs
This involves sending a list of structured messages to the LLM, providing it with context or history to continue the conversation.
Temperature and top_p are two crucial hyperparameters that influence the creativity and determinism of the output.
- Explicit Instructions: Detailed instructions yield better results.
- Stylization and Formatting: Adjusting the style or format of the response.
- Restrictions: Limiting sources or types of information.
- Zero-Shot and Few-Shot Prompting: Techniques using examples to guide the model's response.
- Role Prompting: Assigning a specific role to the model for more consistent responses.
- Chain-of-Thought: Encouraging step-by-step reasoning.
- Self-Consistency: Enhancing accuracy by aggregating multiple responses.
- Retrieval-Augmented Generation (RAG)
- Incorporating external information into prompts for more accurate responses.
Program-Aided Language Models (PAL)
Combining LLMs with code generation for tasks like calculations.
Limiting Extraneous Tokens
Techniques to minimize unnecessary content in model responses.
- Lil'Log Prompt Engineering Guide