A New Approach to Small Models

Why I Stopped
Compressing Models

And started building something that doesn't need 8 H100s to think

Scroll for the truth

A Note Before We Begin

You might have noticed that everything on my profile got deleted. That was intentional. Let me explain why.

I used to fill my profile with compressed models. Dozens of them. But I realized that quantity was masking the real problem. I wasn't building anything new. Just cloning and shrinking other people's work.

So I wiped the slate clean. A fresh start. This time, I'm here to build from scratch.

The Confession

Let me be honest with you. For months, I was that person. You know the one. Leaving my computer on overnight, running distillation scripts, trying to squeeze Claude opus 4.6 into something that could run on a potato.

And you know what? It was boring.

Who actually enjoys watching loss curves descend at 3 AM? Who gets excited about shaving off 2% of parameters while the model forgets how to count?

"I was cloning someone else's work and compressing it. The process lacked real creation. Just digital photocopying with extra steps."

So I stopped. And I started asking a different question:

What if a model could be small by design, avoiding compression entirely?

What I'm Building Instead

FMN-GPT

Factored Multiplicative Neuron Transformer

A transformer architecture where each neuron can call backward into the network. A fundamentally different way to think, designed from scratch.

491 Vocab Size

65K Context Length

120 Max Loops/Neuron

Everything is subject to change.

The Architecture

Input Embeddings

Shared Transformer Block x6

Multi-Query Attention 4 heads

FMN Feedforward Rank 40

Dynamic Routing REINFORCE based

Recurrent State

Output (Weight-Tied)

How It Actually Works

Factored Multiplicative Neurons

Traditional neurons compute y = σ(Wx + b). Simple, but limited.

FMN neurons compute:

gate = tanh(W1(x)) * sigmoid(W2(x))

output = V(gate)

Each FMN uses a multiplicative gating mechanism with two weight matrices W1 and W2 that project to a rank 40 latent space, then combines via tanh and sigmoid before projecting back through V.

Rank 40 latent dimension for efficient computation
Multiplicative gating enables complex interactions
Optional SwiGLU variant available
Initialized with Xavier uniform for stability

Dynamic Neuron Routing

Each neuron can route its output to any layer and any neuron in the network using REINFORCE policy gradients for true gradient flow through hard routing decisions.

should_route ~ Bernoulli(sigmoid(logits))

target_layer ~ Categorical(softmax(layer_logits))

The router learns:

Whether to route via learned sigmoid gate
Which layer to target (0 to n_layers)
Which neuron to target (0 to d_model)
Routing strength via learned parameter

Recurrent Mixer

A per-channel gated recurrent state that persists across layer iterations. Think of it as a tiny LSTM for each dimension.

g = sigmoid(x * w_x + s * w_s + b)

s_new = s * (1 - g) + x * g

This allows the model to accumulate information across the 6 layer passes, creating a form of internal memory. The state scale parameter controls how much the recurrent state influences each layer.

Loop Counter

Each neuron has a hard limit on how many times it can participate in routing loops. This prevents infinite cycles and forces the model to be efficient.

loop_exhausted = loop_counts >= max_loops

With max_loops = 120, each neuron can participate in at most 120 routing events before being silenced. This creates a form of computational budget.

Max Loops: 120

Explain Like I'm Five

How does this tiny model think?

What is FMN-GPT?

Imagine a really tiny brain. Most AI brains today are huge, like a library with millions of books. FMN-GPT is more like a small notebook. But here's the trick. It can read that notebook over and over, each time understanding a little more.

Why Character-Level?

Most AI models learn whole words at a time. We taught this one to read letter by letter, like a child learning to read. This keeps it small. It only needs to know 491 characters instead of thousands of words. Every letter matters.

How Does It Think?

When you ask a question, the model passes it through the same brain circuit 6 times. Each pass, it thinks a little deeper. Like asking yourself "what's 2+2?" and then checking, "Wait, let me think again. 2+2... that's adding two groups of two... so that makes 4!"

The Magic Number: 491 Characters

The model uses a character-level vocabulary of exactly 491 tokens. This includes ASCII characters, special symbols, and custom thinking tokens (the thinking emoji and light bulb emoji). Every character matters.

Why I Really Stopped

It Was Boring

There's no creativity in distillation. You're just making a smaller copy of someone else's breakthrough. Where's the fun in that?

Diminishing Returns

Every 1% of parameter reduction came with a measurable drop in capability. The tradeoff wasn't worth it.

Overnight Runs

Who leaves their computer on overnight to clone someone else's work and compress it? The process lacks real creation. Mere photocopying in disguise.

A Better Question

Instead of asking "how do I make this smaller?", I started asking "what if it was designed to be small from the start?"

The Old Way vs. The New Way

Model Compression

Start with 7B parameters
Distill, quantize, prune
End with 1B parameters
Capability: ???
Time: Weeks of compute

Small by Design

Start with ~100K parameters
Novel architecture
End with ~100K parameters
Capability: Emergent
Time: One GPU, one night

Roadmap

Where we're headed (everything is subject to change)

Phase 1: Core Architecture

FMN neurons with rank 40, REINFORCE based dynamic routing, recurrent mixer, QK normalization, gated residuals. The foundation is complete.

Completed

Phase 2: Training Pipeline

Character-level tokenization (491 vocab), 7 instruction datasets, pretraining on English-Pretraining-Dataset, AdamW optimizer, bfloat16 precision.

In Progress

Phase 3: CoT Reasoning

Teaching the model to think step-by-step with explicit reasoning traces.

Planned

Phase 4: Evaluation Suite

Comprehensive benchmarks to measure what 100K parameters can actually do.

Planned

Phase 5: Model Release

Open weights on HuggingFace for the community to experiment with.

Planned

Dataset Credits

The data that trains our model (subject to change)

Pretraining Dataset

shuyuej/English-Pretraining-Dataset

Large-scale English text for initial language understanding and general knowledge.

View on HuggingFace

Instruction Dataset 1

TeichAI/Pony-Alpha-15k

Conversational instruction data for learning dialogue patterns and responses.

View on HuggingFace

Instruction Dataset 2

TeichAI/convo-v1

Multi-turn conversation data for context handling and coherent dialogue.

View on HuggingFace

Instruction Dataset 3

TeichAI/Step-3.5-Flash-2600x

High-quality instruction-response pairs for fine-tuning reasoning capabilities.

View on HuggingFace

Instruction Dataset 4

TeichAI/sherlock-thinking-alpha-11000x

Thinking and reasoning data for chain of thought training.

View on HuggingFace

Instruction Dataset 5

TeichAI/glm-4.7-2000x

GLM model outputs for diverse response patterns.

View on HuggingFace

Instruction Dataset 6

TeichAI/claude-haiku-4.5-high-reasoning-1700x

Claude reasoning outputs for advanced thinking patterns.

View on HuggingFace

Instruction Dataset 7

TeichAI/gemini-3-flash-preview

Gemini model outputs for additional diversity.

View on HuggingFace

Beyond Competing with Claude opus 4.6

The real goal is understanding what's actually necessary for intelligence to emerge. Maybe 100K parameters is enough. Maybe it isn't. But we won't know until we try building from first principles instead of just compressing what already exists.

FMN-GPT will be 100% open-source. This includes the training script, full model weights, all checkpoints, the base model, the instruction model, and our custom inference engine. We believe that open research accelerates progress for everyone.

"Smaller models serve as a means toward a larger goal. Understanding what makes models work in the first place is the real objective."

It's coming.

Small enough to hide in plain sight,

Big enough to twist the stars of night,

Quiet as a shadow, sharp as a spark,

A tiny flame that will light the dark.

This is an ongoing experiment. Everything described here is subject to change. Follow along if you're curious about what this architecture can actually do.

Why I StoppedCompressing Models

A Note Before We Begin

The Confession

What I'm Building Instead

FMN-GPT

The Architecture

How It Actually Works

Factored Multiplicative Neurons

Dynamic Neuron Routing

Recurrent Mixer

Loop Counter

Explain Like I'm Five

What is FMN-GPT?

Why Character-Level?

How Does It Think?

The Magic Number: 491 Characters

Why I Really Stopped

It Was Boring

Diminishing Returns

Overnight Runs

A Better Question

The Old Way vs. The New Way

Model Compression

Small by Design

Roadmap

Phase 1: Core Architecture

Phase 2: Training Pipeline

Phase 3: CoT Reasoning

Phase 4: Evaluation Suite

Phase 5: Model Release

Dataset Credits

Pretraining Dataset

Instruction Dataset 1

Instruction Dataset 2

Instruction Dataset 3

Instruction Dataset 4

Instruction Dataset 5

Instruction Dataset 6

Instruction Dataset 7

Beyond Competing with Claude opus 4.6

Why I Stopped
Compressing Models