You Will Hit the Token Limit

Spend enough time reading AI workflow advice and you start to notice a promise hiding under the tips. Set things up this way, the story goes, and you will never hit the token limit again. The newer version drops even the pretense: context windows don’t matter anymore. Both are wrong, and the second one is worse, because it sounds like progress.

I want to be fair to the techniques, because most of them are good. Compression, retrieval, prompt caching, vector stores, MCP memory servers, session summaries — they genuinely extend how far you get before the window fills, and I use them. But somewhere between a useful practice and a confident headline, “this helps” turned into “this makes the limit disappear.” It doesn’t. A context window is finite computation. Whether it holds two hundred thousand tokens or two million, it is still a ceiling. None of those tricks create infinite memory. They only change what the model carries forward when it can’t carry everything.

The sleight of hand is the gap between storage and attention. A vector database can hold years of material. A filesystem can keep every document you have ever written. An MCP server can expose all of it elegantly. None of that is the context the model is actually reasoning over right now. That active window is small, and everything outside it has to be summarized, retrieved, or dropped to get back in. Persistent storage is not working memory, and pretending otherwise is how projects end up architecturally fragile.

More memory was never the goal

Here is the part the “never lose context” advice skips: more context is not better context. Past a point, dragging the whole history forward actively hurts the work. Old assumptions contaminate new thinking. Ideas you killed three sessions ago wander back in. Two conflicting instructions sit in the buffer and the model quietly splits the difference. I have watched a long thread go slow and vague not because it ran out of room, but because it was carrying too many ghosts.

Forgetting is not the bug here. A team that never closes a loop loses the plot. A project with no archive becomes impossible to navigate. Selection is what keeps a system usable — for models and for people.

Tokens are not what runs out first

Tokens are easy to count, so they get the worry. On real projects, other things run dry sooner: the patience to re-explain the same context, the momentum to keep moving after a reset, the coherence to know which of three drafts is the real one. The friction you actually feel — “didn’t we decide this already?”, “which version is current?”, “why is it pushing the idea we killed?” — is not a token problem. It is a continuity problem. And you do not fix continuity by making the transcript longer.

Design for continuity, not immortality

Once you stop trying to keep one conversation alive forever, the goal gets smaller and clearer. The important parts of the work should live outside the chat. Resets should happen on your terms, not as a surprise truncation. Coming back should feel like resuming a project, not exhuming one.

In practice that means treating the chat as a workspace and keeping the real memory in a few plain files that outlive any session. For most work, three are enough.

plan.md is the charter: what we are building, for whom, what is in and out of scope, what “done” means. I write it before serious work starts and only touch it when something structural changes.

# Plan — CSV import for the contacts module

## Objective
Let users bulk-import contacts from a CSV, with a preview step.

## Scope
- In:  parse, validate, map columns, preview, commit
- Out: scheduled/recurring imports (later)

## Constraints
- Handle 50k rows without freezing the UI
- No new dependencies; reuse the existing parser

## Done when
- A bad file fails loudly before a single contact is written

handoff.md is where you are standing: what got done, what you decided and why, what is still broken, and the one next step. I write it at the end of a working block, before the window gets crowded enough to trigger compaction, because the summary comes out sharper while the detail is still fresh.

# Handoff — 24 May, end of session

## Where we are
- Phase: build. Files: server/import.ts, components/ColumnMap.tsx

## Decided
- Dedupe by email, not name. Name collisions were too common to trust.

## Still open
- Preview renders, but bad rows aren't highlighted yet
- Rollback on partial failure is stubbed, not wired up

## Next
- Hook highlightErrors() into the preview table, then test the 50k sample

log.md is optional, and it is for you more than the model: a running note of what you tried, what you kept, and what you threw out. You don’t paste it into the chat. You keep it so you don’t re-run a dead experiment in three weeks because the thread that proved it failed is long gone.

# Log — contacts import

## 24 May
- Tried client-side parsing for speed. Dropped it; 50k rows froze the tab.
- Kept server-side stream parse instead.

Two habits make these files earn their place. The first is to stop treating one chat as the unit of work and think in phases instead — frame the problem, sketch the structure, build the draft, polish — each one ending with an updated handoff and a deliberate close rather than a context error. The second is to write the handoff on human triggers, not when the model complains: when you catch yourself re-explaining a constraint, when a real chunk of work just landed, or right before you paste a large document that will balloon the context. Those are the moments to checkpoint, then decide whether to keep going or start clean from the charter, the artifact, and the handoff.

You will still hit the limit, and that’s fine

Do all of this and you will still run out of room sometimes. A project grows faster than you expected. A collaborator dumps a large corpus mid-stream. A provider changes its window or its truncation rules. None of that is failure. When it happens, the only questions that matter are whether you lost anything that lived only in the chat, and whether you can find your place again in under a minute. If the answer is a quick restart instead of a reconstruction, the token limit has become what it always should have been — an ordinary constraint, like a timebox or a sprint length.

This is a leadership problem, not a prompting one

I keep arriving at the same conclusion from a different direction than most AI writing does. This is not really about prompts. Once a model becomes shared infrastructure for a team, someone has to decide what deserves to survive a transition, and that is a leadership job. It looks like ending phases on purpose, writing decisions down where everyone can find them, naming the canonical artifact, and refusing to accept “the model forgot” as an explanation when the real problem is that nobody captured what mattered. The teams that get the most out of AI won’t be the ones with the biggest context windows. They’ll be the ones that treat continuity as something they own.

So yes, save tokens. Cut the repetition, drop the dead history, keep the live context focused; it is cheaper and the answers are usually better for it. But that is the side effect, not the point. The point is that when a session ends, the project still makes sense, the decisions are still recoverable, and picking it back up feels like continuation instead of starting over. Never hitting the limit was never the goal. Surviving it intact is.