Posted on

Where to Find GPT-2 Source Code

  1. OpenAI’s Official GitHub Repository:
  • OpenAI released parts of GPT-2’s code in their GitHub repository: github.com/openai/gpt-2.
  • This includes a TensorFlow implementation for the 124M parameter model (smallest version), along with utilities for downloading pre-trained weights and generating text.
  • Note: The full 1.5B parameter model’s training code and weights weren’t fully released initially, but smaller models and sample code are available.
  • Usage example:
    bash git clone https://github.com/openai/gpt-2.git cd gpt-2 pip install -r requirements.txt python download_model.py 124M python src/interactive_conditional_samples.py --model_name=124M
  1. Hugging Face Transformers Library:
  • Hugging Face provides a PyTorch and TensorFlow implementation of GPT-2 that’s widely used and well-documented: huggingface.co/models?filter=gpt2.
  • You can load pre-trained GPT-2 models easily:
    python from transformers import GPT2Tokenizer, GPT2LMHeadModel tokenizer = GPT2Tokenizer.from_pretrained("gpt2") model = GPT2LMHeadModel.from_pretrained("gpt2") input_ids = tokenizer.encode("Hello, world!", return_tensors="pt") output = model.generate(input_ids, max_length=50) print(tokenizer.decode(output[0], skip_special_tokens=True))
  • Source code is in their GitHub: github.com/huggingface/transformers, under src/transformers/models/gpt2.
  1. Open-Source Replications:
  • OpenGPT-2: A community replication released in August 2019, approximating the original GPT-2. Check repositories like github.com/openai-community/gpt2 or related forks on GitHub.
  • NanoGPT by Andrej Karpathy: A minimal, educational implementation in PyTorch (~300 lines), not the full GPT-2 but a simplified version: github.com/karpathy/nanoGPT. It’s great for understanding the architecture.
  1. Minimal Implementations:
  • Projects like github.com/iVishalr/GPT offer clean, minimal PyTorch implementations of GPT-like models, including GPT-2’s architecture.

GPT-2 Architecture Overview

GPT-2 is a decoder-only transformer model with 1.5 billion parameters (in its largest version), trained on WebText (8 million web pages) to predict the next word in a sequence. Key components:

  • Embedding Layer: Converts input tokens to dense vectors.
  • Positional Encoding: Adds position information since transformers lack inherent order.
  • Transformer Blocks: Stacked layers (e.g., 12 for 124M, 48 for 1.5B) with:
  • Masked Multi-Head Self-Attention: Attends only to previous tokens.
  • Feed-Forward Neural Network: Processes each token independently.
  • Layer Normalization: Stabilizes training.
  • Output Layer: Maps hidden states to vocabulary probabilities.

Simplified Conceptual C++ Implementation

Below is a basic outline of a GPT-2-like model in C++. This isn’t the full 1.5B-parameter model (which would require massive memory and optimized libraries like PyTorch/TensorFlow) but illustrates the core structure. You’d need a linear algebra library (e.g., Eigen) and significant optimization for a real implementation.

#include <vector>
#include <cmath>
#include <iostream>

// Simplified matrix class (in practice, use Eigen or similar)
class Matrix {
public:
    std::vector<std::vector<float>> data;
    int rows, cols;
    Matrix(int r, int c) : rows(r), cols(c) {
        data.resize(r, std::vector<float>(c, 0.0));
    }
    // Add matrix operations (multiplication, addition, etc.) here
};

// Self-Attention (simplified, no multi-head or scaling)
Matrix selfAttention(const Matrix& query, const Matrix& key, const Matrix& value) {
    // Q * K^T (transpose key)
    Matrix attentionScores(query.rows, key.rows);
    for (int i = 0; i < query.rows; i++) {
        for (int j = 0; j < key.rows; j++) {
            float score = 0;
            for (int k = 0; k < query.cols; k++) {
                score += query.data[i][k] * key.data[j][k];
            }
            attentionScores.data[i][j] = score;
        }
    }
    // Apply mask (causal: no future tokens)
    for (int i = 0; i < attentionScores.rows; i++) {
        for (int j = i + 1; j < attentionScores.cols; j++) {
            attentionScores.data[i][j] = -1e9; // Negative infinity
        }
    }
    // Softmax and multiply with value (simplified)
    Matrix output(query.rows, value.cols);
    // Implement softmax and V multiplication here (omitted for brevity)
    return output;
}

class GPT2Layer {
public:
    Matrix Wq, Wk, Wv; // Weights for Q, K, V
    GPT2Layer(int d_model) : Wq(d_model, d_model), Wk(d_model, d_model), Wv(d_model, d_model) {
        // Initialize weights (random or pre-trained)
    }
    Matrix forward(const Matrix& input) {
        Matrix query = input; // Multiply with Wq
        Matrix key = input;   // Multiply with Wk
        Matrix value = input; // Multiply with Wv
        return selfAttention(query, key, value);
    }
};

class GPT2 {
public:
    std::vector<GPT2Layer> layers;
    int vocab_size, d_model;
    GPT2(int num_layers, int vocab_size, int d_model) 
        : vocab_size(vocab_size), d_model(d_model) {
        for (int i = 0; i < num_layers; i++) {
            layers.emplace_back(d_model);
        }
    }
    Matrix forward(const std::vector<int>& tokens) {
        Matrix input(1, d_model); // Simplified embedding
        Matrix hidden = input;
        for (auto& layer : layers) {
            hidden = layer.forward(hidden);
        }
        return hidden; // Add output layer for vocab probs
    }
};

int main() {
    GPT2 model(12, 50257, 768); // 12 layers, GPT-2 vocab size, embedding dim
    std::vector<int> tokens = {1, 2, 3}; // Dummy input
    Matrix output = model.forward(tokens);
    std::cout << "Model ran successfully!\n";
    return 0;
}

Explanation of Simplified Code

  • Matrix Class: Placeholder for tensor operations (real implementations use CUDA-optimized libraries).
  • Self-Attention: Computes attention scores with a causal mask (no future tokens), though softmax and full multi-head logic are omitted for brevity.
  • GPT2Layer: One transformer block with attention (feed-forward and normalization skipped).
  • GPT2: Stacks layers, processes token embeddings, and outputs hidden states.
  • Limitations: This lacks training logic, real embeddings, and optimization—full GPT-2 needs billions of parameters and GPU support.

Getting the Real Deal

For the actual GPT-2 source:

  • Hugging Face: Clone transformers, explore models/gpt2/modeling_gpt2.py for PyTorch details.
  • OpenAI Repo: Use the TensorFlow code in gpt-2/src/model.py for the original structure.
  • Training Data: WebText isn’t public, but OpenWebText (a replication) is available via community efforts.

This gives you access to functional GPT-2 code or a starting point to build your own.

Posted on

How was the Manus AI code was Leaked?

  • It seems likely that Jian Liao, known online as jlia0, asked Manus for its source code and received it, though in an encrypted form.
  • Research suggests this was part of a leak incident, with discussions around the code’s usability and security.

Background

Manus is an AI agent, often described as a general-purpose tool capable of tasks like research and content creation, developed by a Chinese startup and currently in closed beta with limited invite codes. Its source code is typically not publicly available, making any access notable.

The Incident

Jian Liao, using the username jlia0 on GitHub, obtained an invite code for Manus and reportedly asked the AI to output its own source code, specifically requesting the “/opt/.manus/” directory as a zip file. He received the code, but it was encrypted, limiting its immediate usability. This event sparked discussions on platforms like GitHub about the encryption and potential for reverse-engineering.

Unexpected Detail

While most expected Manus to be a secure, closed system, the ability to extract even encrypted code highlights vulnerabilities in AI agent security, raising questions about prompt injection and system isolation.


Comprehensive Analysis of the Source Code Request Incident

This report delves into the details surrounding the request and acquisition of Manus AI’s source code, focusing on the individual involved, the context of Manus AI, and the implications of the incident. The analysis is based on recent online discussions, GitHub activity, and media coverage as of March 15, 2025.

Context of Manus AI

Manus AI, launched by a Chinese startup, is a general-purpose AI agent designed to perform autonomous tasks such as information retrieval, data processing, content creation, and web automation. It has garnered significant attention, with its Discord channel boasting over 186,000 members and invite codes being resold for high prices on platforms like Xianyu (Manus AI Invitation Code: Application Guide & Success Tips). The system is currently in closed beta, requiring an invite code for access, and is not open source, distinguishing it from projects like DeepSeek, which is an LLM rather than an agent.

Early reviews, such as those from MIT Technology Review (Everyone in AI is talking about Manus. We put it to the test.), describe Manus as promising but imperfect, with capabilities likened to a highly intelligent intern. However, its closed nature and limited access have fueled interest in its underlying technology, leading to replication efforts and security concerns.

The Individual: Jian Liao (jlia0)

Jian Liao, known by the GitHub handle jlia0, is identified as the CTO at Pointer and has been active in AI-related discussions. His GitHub profile (jlia0 (Jian Liao) · GitHub) shows a history of contributions, including a notable gist titled “Manus tools and prompts” (Manus tools and prompts · GitHub). In this gist, published on March 11, 2025, Liao states, “I got invite code of Manus, and ask Manus to output /opt/.manus as zip.” This action resulted in him obtaining the source code, though it was encrypted, as noted in subsequent comments where users discuss the encryption and its implications.

Media reports, such as an article on AIbase (Manus AI System Prompt Leakage: Official Response), confirm that a user named “jian” (likely Jian Liao) “cracked the Manus system” by requesting the directory contents, retrieving “some sensitive information and operational data.” This incident is described as a prompt leak, highlighting potential security flaws in Manus’s sandbox isolation, with the co-founder Ji Yichao noting that the code is lightly obfuscated for command reception.

Details of the Request and Acquisition

Liao’s method involved leveraging Manus AI’s capabilities to output its own internal directory, a technique that exploited the AI’s ability to execute file system operations. The output was a zip file containing the source code, but it was encrypted, likely using tools like PyArmor, as discussed in the gist comments. One comment notes, “A straight forward memory dump -> strings didn’t reveal any manus or pyarmor internals,” indicating the encryption’s robustness (Manus tools and prompts · GitHub).

The encryption limited the code’s usability, with users like @PeterZhao119 questioning how Liao obtained detailed prompts, suggesting skepticism about the leak’s authenticity. However, Liao’s X post (X post) and subsequent discussions, including on Reddit (r/AI_Agents on Reddit: Created an open-source alternative to Manus AI!), reinforce that he did receive the code, albeit in a form requiring further analysis.

Implications and Community Response

The leak sparked significant interest, with open-source alternatives like OpenManus emerging, developed by contributors from MetaGPT (GitHub – mannaandpoem/OpenManus: No fortress, purely open ground. OpenManus is Coming.). OpenManus, launched within three hours, aims to replicate Manus’s functionality without an invite code, but it’s unclear if it directly used Liao’s leaked code. Discussions on GitHub and Reddit highlight efforts to decrypt or reverse-engineer the code, with projects like whit3rabbit/manus-open (GitHub – whit3rabbit/manus-open: Manus code from container) offering AI-generated guesses, noting the code’s potential research value.

Security concerns arose, with articles like “Manus AI’s Agentic Moment: A Case Study in Prompt Leak and Risk Mitigation” on Medium (Manus AI’s Agentic Moment: A Case Study in Prompt Leak and Risk Mitigation | by Xiwei Zhou | Mar, 2025 | Medium) discussing prompt injections and system prompt leakage as risks in generative AI. Manus’s co-founder acknowledged the sandbox’s isolation but noted the code’s light obfuscation, suggesting ongoing efforts to mitigate such vulnerabilities.

Comparative Analysis with Other Leaks

To contextualize, source code leaks are not unique to Manus. High-profile examples include Microsoft’s 37GB leak in 2022 (r/DataHoarder on Reddit: Hackers leak 37GB of Microsoft’s source code (Bing, Cortana and more)), but Manus’s case is distinct due to the method—asking the AI itself rather than a security breach. This highlights a novel vulnerability in AI agents, where user commands can inadvertently expose internal data.

Table: Key Details of the Incident

AspectDetails
Individual InvolvedJian Liao (jlia0), CTO at Pointer, GitHub user
Method of AcquisitionAsked Manus AI to output “/opt/.manus/” directory as zip, received encrypted code
Date of IncidentAround March 9-11, 2025, based on gist and media reports
Code UsabilityEncrypted, likely using PyArmor, limiting immediate use
Community ResponseDiscussions on encryption, replication efforts (OpenManus, manus-open)
Security ImplicationsHighlighted prompt leak risks, sandbox isolation concerns

Conclusion

Jian Liao, known as jlia0, is the individual who asked Manus AI for its source code and received it, though in an encrypted form. This incident, occurring around early March 2025, underscores vulnerabilities in AI agent security and has spurred community efforts to replicate and analyze the technology. The encrypted nature of the code and ongoing discussions suggest a complex landscape of accessibility and security in AI development.

Key Citations