Why I Stopped
Compressing Models
And started building something that doesn't need 8 H100s to think
A Note Before We Begin
You might have noticed that everything on my profile got deleted. That was intentional. Let me explain why.
I used to fill my profile with compressed models. Dozens of them. But I realized that quantity was masking the real problem. I wasn't building anything new. Just cloning and shrinking other people's work.
So I wiped the slate clean. A fresh start. This time, I'm here to build from scratch.
The Confession
Let me be honest with you. For months, I was that person. You know the one. Leaving my computer on overnight, running distillation scripts, trying to squeeze Claude opus 4.6 into something that could run on a potato.
And you know what? It was boring.
Who actually enjoys watching loss curves descend at 3 AM? Who gets excited about shaving off 2% of parameters while the model forgets how to count?
"I was cloning someone else's work and compressing it. The process lacked real creation. Just digital photocopying with extra steps."
So I stopped. And I started asking a different question:
What if a model could be small by design, avoiding compression entirely?
What I'm Building Instead
FMN-GPT
Factored Multiplicative Neuron Transformer
A transformer architecture where each neuron can call backward into the network. A fundamentally different way to think, designed from scratch.
Everything is subject to change.
The Architecture
How It Actually Works
Factored Multiplicative Neurons
Traditional neurons compute y = σ(Wx + b). Simple, but limited.
FMN neurons compute:
gate = tanh(W1(x)) * sigmoid(W2(x))
output = V(gate)
Each FMN uses a multiplicative gating mechanism with two weight matrices W1 and W2 that project to a rank 40 latent space, then combines via tanh and sigmoid before projecting back through V.
- Rank 40 latent dimension for efficient computation
- Multiplicative gating enables complex interactions
- Optional SwiGLU variant available
- Initialized with Xavier uniform for stability
Dynamic Neuron Routing
Each neuron can route its output to any layer and any neuron in the network using REINFORCE policy gradients for true gradient flow through hard routing decisions.
should_route ~ Bernoulli(sigmoid(logits))
target_layer ~ Categorical(softmax(layer_logits))
The router learns:
- Whether to route via learned sigmoid gate
- Which layer to target (0 to n_layers)
- Which neuron to target (0 to d_model)
- Routing strength via learned parameter
Recurrent Mixer
A per-channel gated recurrent state that persists across layer iterations. Think of it as a tiny LSTM for each dimension.
g = sigmoid(x * w_x + s * w_s + b)
s_new = s * (1 - g) + x * g
This allows the model to accumulate information across the 6 layer passes, creating a form of internal memory. The state scale parameter controls how much the recurrent state influences each layer.
Loop Counter
Each neuron has a hard limit on how many times it can participate in routing loops. This prevents infinite cycles and forces the model to be efficient.
loop_exhausted = loop_counts >= max_loops
With max_loops = 120, each neuron can participate in at most 120 routing events before being silenced. This creates a form of computational budget.
Explain Like I'm Five
How does this tiny model think?
What is FMN-GPT?
Imagine a really tiny brain. Most AI brains today are huge, like a library with millions of books. FMN-GPT is more like a small notebook. But here's the trick. It can read that notebook over and over, each time understanding a little more.
Why Character-Level?
Most AI models learn whole words at a time. We taught this one to read letter by letter, like a child learning to read. This keeps it small. It only needs to know 491 characters instead of thousands of words. Every letter matters.
How Does It Think?
When you ask a question, the model passes it through the same brain circuit 6 times. Each pass, it thinks a little deeper. Like asking yourself "what's 2+2?" and then checking, "Wait, let me think again. 2+2... that's adding two groups of two... so that makes 4!"
The Magic Number: 491 Characters
The model uses a character-level vocabulary of exactly 491 tokens. This includes ASCII characters, special symbols, and custom thinking tokens (the thinking emoji and light bulb emoji). Every character matters.
Why I Really Stopped
It Was Boring
There's no creativity in distillation. You're just making a smaller copy of someone else's breakthrough. Where's the fun in that?
Diminishing Returns
Every 1% of parameter reduction came with a measurable drop in capability. The tradeoff wasn't worth it.
Overnight Runs
Who leaves their computer on overnight to clone someone else's work and compress it? The process lacks real creation. Mere photocopying in disguise.
A Better Question
Instead of asking "how do I make this smaller?", I started asking "what if it was designed to be small from the start?"
The Old Way vs. The New Way
Model Compression
- Start with 7B parameters
- Distill, quantize, prune
- End with 1B parameters
- Capability: ???
- Time: Weeks of compute
Small by Design
- Start with ~100K parameters
- Novel architecture
- End with ~100K parameters
- Capability: Emergent
- Time: One GPU, one night
Roadmap
Where we're headed (everything is subject to change)
Phase 1: Core Architecture
FMN neurons with rank 40, REINFORCE based dynamic routing, recurrent mixer, QK normalization, gated residuals. The foundation is complete.
CompletedPhase 2: Training Pipeline
Character-level tokenization (491 vocab), 7 instruction datasets, pretraining on English-Pretraining-Dataset, AdamW optimizer, bfloat16 precision.
In ProgressPhase 3: CoT Reasoning
Teaching the model to think step-by-step with explicit reasoning traces.
PlannedPhase 4: Evaluation Suite
Comprehensive benchmarks to measure what 100K parameters can actually do.
PlannedPhase 5: Model Release
Open weights on HuggingFace for the community to experiment with.
PlannedDataset Credits
The data that trains our model (subject to change)
Pretraining Dataset
shuyuej/English-Pretraining-Dataset
Large-scale English text for initial language understanding and general knowledge.
View on HuggingFaceInstruction Dataset 1
TeichAI/Pony-Alpha-15k
Conversational instruction data for learning dialogue patterns and responses.
View on HuggingFaceInstruction Dataset 2
TeichAI/convo-v1
Multi-turn conversation data for context handling and coherent dialogue.
View on HuggingFaceInstruction Dataset 3
TeichAI/Step-3.5-Flash-2600x
High-quality instruction-response pairs for fine-tuning reasoning capabilities.
View on HuggingFaceInstruction Dataset 4
TeichAI/sherlock-thinking-alpha-11000x
Thinking and reasoning data for chain of thought training.
View on HuggingFaceInstruction Dataset 5
TeichAI/glm-4.7-2000x
GLM model outputs for diverse response patterns.
View on HuggingFaceInstruction Dataset 6
TeichAI/claude-haiku-4.5-high-reasoning-1700x
Claude reasoning outputs for advanced thinking patterns.
View on HuggingFaceInstruction Dataset 7
TeichAI/gemini-3-flash-preview
Gemini model outputs for additional diversity.
View on HuggingFaceBeyond Competing with Claude opus 4.6
The real goal is understanding what's actually necessary for intelligence to emerge. Maybe 100K parameters is enough. Maybe it isn't. But we won't know until we try building from first principles instead of just compressing what already exists.
FMN-GPT will be 100% open-source. This includes the training script, full model weights, all checkpoints, the base model, the instruction model, and our custom inference engine. We believe that open research accelerates progress for everyone.
"Smaller models serve as a means toward a larger goal. Understanding what makes models work in the first place is the real objective."
It's coming.
Small enough to hide in plain sight,
Big enough to twist the stars of night,
Quiet as a shadow, sharp as a spark,
A tiny flame that will light the dark.
This is an ongoing experiment. Everything described here is subject to change. Follow along if you're curious about what this architecture can actually do.