on the edge of HOPE and whatever comes next after transformers
im in a week where exams, new architectures, and my brain all collapsed at the same time
every exam season feels like a cosmic joke. i’m trying to force myself to memorize notes that already feel obsolete while the ai world keeps dropping papers that feel like previews from a future i want to be building, not reading. my boards are around the corner, and everything is chaotic. that weird split between being a founder and being a student hits especially hard right now. one minute i’m revising a chapter on something i’ll forget in three weeks, the next minute i’m staring at a new architecture that rewrites half of what we thought we knew.
this week, google dropped HOPE. and almost as if the universe wanted to punch me harder, i revisited the baby dragon hatchling paper, which already had me doubting the entire transformer worldview. transformers feel like they’re held together by duct tape sometimes. bd hatchling just throws the duct tape in the trash and says, what if we built the model like actual neurons. synapses firing, wiring, storing, adapting. no tokens. no chunking. no kv cache. no context windows. infinite memory context because memory isn’t a window, it’s the architecture.
the entire field feels like it’s tiptoeing toward AGI. not sprinting. not even walking properly. more like taking retarded steps in the right direction. embarrassingly clumsy but undeniably forward.
and while all this was happening, sagea had its own small breakthrough. we finally got multimodality into SAGE running efficiently on very low compute without breaking anything. it’s not the highlight of this blog because i’m not trying to make this a sagea update, but i’ll say this: seeing multimodal reasoning work on tiny budgets feels surreal. like watching a small team punch way above its size. but that’s a story for later. right now, my brain is stuck in this swirl where exams and future-of-ai papers coexist in the same overloaded prefrontal cortex.
anyway. let me get into the real reason my focus died this week.
when i read the nested learning paper from google again as prep for HOPE, i realized i’d never appreciated how absurdly incomplete our mental model of optimizers is. we’ve been treating them like external tools that tweak weights. static model, external optimizer. but the google paper basically says: hey, you idiots, optimizers are actually memory systems with their own dynamics. and once you see the math, you can’t unsee it.
the thing that hit me the hardest is how they show that even plain sgd with momentum is a two level nested optimization system. not conceptually. literally.
the outer loop updates the model:
w_{t+1} = w_t - η * m_t this we all know. but the momentum update, which we treat like some convenient historical smoothing trick, is actually solving an optimization problem at each step.
instead of writing the usual assignment
m_t = β * m_{t-1} + (1 - β) * ∇L_t(w_t) the paper rewrites it as:
m_t = argmin_m || m - ( β * m_{t-1} + (1 - β) * ∇L_t(w_t) ) ||^2 that’s insane if you think about it. this means momentum is not a buffer. it is literally the solution to a minimization problem. a tiny solver running inside your main solver. learning inside learning. an optimizer nested inside an optimizer.
if you expand the quadratic expression, you get:
m_t = argmin_m (m - A_t)^T (m - A_t) with
A_t = β * m_{t-1} + (1 - β) * ∇L_t(w_t) and then of course the minimizer is
m_t = A_t but the point isn’t the result. it’s the fact that the momentum update is actually the system’s way of compressing information across time. a learned averaging over a hidden temporal dimension.
and suddenly the layers of a transformer feel like a distraction. depth isn’t about stacking matrices. depth is about stacking timescales. memory inside memory. updates nested inside updates.
this is where the connection to baby dragon hatchling nearly broke my brain.
bd hatchling does not care about transformers. it does not care about layers. it does not care about kv caches, sliding windows, rotary embeddings, or any of that. it builds an architecture where memory is synaptic. where the update rule is literally:
Δw_ij = η * x_i * y_j with x_i being presynaptic activation and y_j being postsynaptic activation.
a direct hebbian update. neurons that fire together wire together. the architecture becomes the memory.
reading this after nested learning felt like someone showing me a mathematical theory of learning, then someone else showing me the biological implementation.
nested learning is the theoretical skeleton. bd hatchling is the architectural muscle.
i kept thinking: wow. we’re watching the transformer era slowly crack open. not explode. not collapse dramatically. just tiny fractures showing up every month as researchers realize transformers don’t scale to intelligence, only to performance.
transformers are powerful, but they’re hacks on hacks. tokenization. softmax. kv caches. positional encodings. attention scaling. everything glued together like a high budget speedrun.
then these new architectures show up whispering, hey maybe the brain had a reason to be structured the way it is.
no tokens. no windows. infinite memory because memory is distributed, not stored in a sliding buffer.
and then google drops HOPE right in the middle of all this. i haven’t even finished all the supplementary material because, again, exams exist, but it’s clear they’re also exploring alternatives to transformer dominance.
the field is restless. you can feel it.
meanwhile, i’m revising for exams i’m not emotionally prepared for. trying to prepare for something that feels so distant from the world i’m operating in. and yet i know i have to do it. finish this phase. clear the noise. then i can dive into all this properly.
there’s a strange comfort in knowing that while the field evolves in retarded but correct steps, i can take a temporary pause without falling behind. not because the world will slow down for me, but because these breakthroughs take years to mature anyway. bd hatchling is experimental. nested learning is theoretical. HOPE is fresh. even sagea’s multimodality work feels like a tiny spark in a long journey.
so yeah. everything is chaotic. everything is moving too fast. everything is exciting. and i’m stuck temporarily studying for exams. god help me.
i’ll probably write the next blog after all this exam nonsense ends. then i can think clearly again. for now, this is the best i can squeeze out while my brain is juggling neural synapses on one side and exam syllabus and brainrot on the other.
ciao, basab