Where to Find GPT-2 Source Code

OpenAI’s Official GitHub Repository:

OpenAI released parts of GPT-2’s code in their GitHub repository: github.com/openai/gpt-2.
This includes a TensorFlow implementation for the 124M parameter model (smallest version), along with utilities for downloading pre-trained weights and generating text.
Note: The full 1.5B parameter model’s training code and weights weren’t fully released initially, but smaller models and sample code are available.
Usage example:
bash git clone https://github.com/openai/gpt-2.git cd gpt-2 pip install -r requirements.txt python download_model.py 124M python src/interactive_conditional_samples.py --model_name=124M

Hugging Face Transformers Library:

Hugging Face provides a PyTorch and TensorFlow implementation of GPT-2 that’s widely used and well-documented: huggingface.co/models?filter=gpt2.
You can load pre-trained GPT-2 models easily:
python from transformers import GPT2Tokenizer, GPT2LMHeadModel tokenizer = GPT2Tokenizer.from_pretrained("gpt2") model = GPT2LMHeadModel.from_pretrained("gpt2") input_ids = tokenizer.encode("Hello, world!", return_tensors="pt") output = model.generate(input_ids, max_length=50) print(tokenizer.decode(output[0], skip_special_tokens=True))
Source code is in their GitHub: github.com/huggingface/transformers, under src/transformers/models/gpt2.

Open-Source Replications:

OpenGPT-2: A community replication released in August 2019, approximating the original GPT-2. Check repositories like github.com/openai-community/gpt2 or related forks on GitHub.
NanoGPT by Andrej Karpathy: A minimal, educational implementation in PyTorch (~300 lines), not the full GPT-2 but a simplified version: github.com/karpathy/nanoGPT. It’s great for understanding the architecture.

Minimal Implementations:

Projects like github.com/iVishalr/GPT offer clean, minimal PyTorch implementations of GPT-like models, including GPT-2’s architecture.

GPT-2 Architecture Overview

GPT-2 is a decoder-only transformer model with 1.5 billion parameters (in its largest version), trained on WebText (8 million web pages) to predict the next word in a sequence. Key components:

Embedding Layer: Converts input tokens to dense vectors.
Positional Encoding: Adds position information since transformers lack inherent order.
Transformer Blocks: Stacked layers (e.g., 12 for 124M, 48 for 1.5B) with:
Masked Multi-Head Self-Attention: Attends only to previous tokens.
Feed-Forward Neural Network: Processes each token independently.
Layer Normalization: Stabilizes training.
Output Layer: Maps hidden states to vocabulary probabilities.

Simplified Conceptual C++ Implementation

Below is a basic outline of a GPT-2-like model in C++. This isn’t the full 1.5B-parameter model (which would require massive memory and optimized libraries like PyTorch/TensorFlow) but illustrates the core structure. You’d need a linear algebra library (e.g., Eigen) and significant optimization for a real implementation.

#include <vector>
#include <cmath>
#include <iostream>

// Simplified matrix class (in practice, use Eigen or similar)
class Matrix {
public:
    std::vector<std::vector<float>> data;
    int rows, cols;
    Matrix(int r, int c) : rows(r), cols(c) {
        data.resize(r, std::vector<float>(c, 0.0));
    }
    // Add matrix operations (multiplication, addition, etc.) here
};

// Self-Attention (simplified, no multi-head or scaling)
Matrix selfAttention(const Matrix& query, const Matrix& key, const Matrix& value) {
    // Q * K^T (transpose key)
    Matrix attentionScores(query.rows, key.rows);
    for (int i = 0; i < query.rows; i++) {
        for (int j = 0; j < key.rows; j++) {
            float score = 0;
            for (int k = 0; k < query.cols; k++) {
                score += query.data[i][k] * key.data[j][k];
            }
            attentionScores.data[i][j] = score;
        }
    }
    // Apply mask (causal: no future tokens)
    for (int i = 0; i < attentionScores.rows; i++) {
        for (int j = i + 1; j < attentionScores.cols; j++) {
            attentionScores.data[i][j] = -1e9; // Negative infinity
        }
    }
    // Softmax and multiply with value (simplified)
    Matrix output(query.rows, value.cols);
    // Implement softmax and V multiplication here (omitted for brevity)
    return output;
}

class GPT2Layer {
public:
    Matrix Wq, Wk, Wv; // Weights for Q, K, V
    GPT2Layer(int d_model) : Wq(d_model, d_model), Wk(d_model, d_model), Wv(d_model, d_model) {
        // Initialize weights (random or pre-trained)
    }
    Matrix forward(const Matrix& input) {
        Matrix query = input; // Multiply with Wq
        Matrix key = input;   // Multiply with Wk
        Matrix value = input; // Multiply with Wv
        return selfAttention(query, key, value);
    }
};

class GPT2 {
public:
    std::vector<GPT2Layer> layers;
    int vocab_size, d_model;
    GPT2(int num_layers, int vocab_size, int d_model) 
        : vocab_size(vocab_size), d_model(d_model) {
        for (int i = 0; i < num_layers; i++) {
            layers.emplace_back(d_model);
        }
    }
    Matrix forward(const std::vector<int>& tokens) {
        Matrix input(1, d_model); // Simplified embedding
        Matrix hidden = input;
        for (auto& layer : layers) {
            hidden = layer.forward(hidden);
        }
        return hidden; // Add output layer for vocab probs
    }
};

int main() {
    GPT2 model(12, 50257, 768); // 12 layers, GPT-2 vocab size, embedding dim
    std::vector<int> tokens = {1, 2, 3}; // Dummy input
    Matrix output = model.forward(tokens);
    std::cout << "Model ran successfully!\n";
    return 0;
}

Explanation of Simplified Code

Matrix Class: Placeholder for tensor operations (real implementations use CUDA-optimized libraries).
Self-Attention: Computes attention scores with a causal mask (no future tokens), though softmax and full multi-head logic are omitted for brevity.
GPT2Layer: One transformer block with attention (feed-forward and normalization skipped).
GPT2: Stacks layers, processes token embeddings, and outputs hidden states.
Limitations: This lacks training logic, real embeddings, and optimization—full GPT-2 needs billions of parameters and GPU support.

Getting the Real Deal

For the actual GPT-2 source:

Hugging Face: Clone transformers, explore models/gpt2/modeling_gpt2.py for PyTorch details.
OpenAI Repo: Use the TensorFlow code in gpt-2/src/model.py for the original structure.
Training Data: WebText isn’t public, but OpenWebText (a replication) is available via community efforts.

This gives you access to functional GPT-2 code or a starting point to build your own.