LSTM published โ recurrent baseline
Hochreiter and Schmidhuber publish Long Short-Term Memory, the recurrent architecture that becomes the dominant language-model building block for the next 20 years.
Eight Google Brain engineers published 'Attention Is All You Need.' In five years it reorganized the tech industry around one architecture and gave OpenAI the building blocks for ChatGPT.
Audio version coming soon
On 12 June 2017, eight researchers at Google Brain and Google Research uploaded a paper to arXiv titled 'Attention Is All You Need.' It described a new neural-network architecture called the Transformer that replaced the recurrent and convolutional layers powering language models with a single mechanism โ self-attention. The paper was technical, modest in claims, and competed for attention with hundreds of other ML papers that month. Five years later, every major AI breakthrough โ GPT-3, GPT-4, Gemini, Claude, LLaMA, Stable Diffusion, AlphaFold 2 โ was built on the Transformer. By 2023 the paper had over 100,000 citations, more than any computer-science paper before it. The eight authors had nearly all left Google by then, founding or joining Anthropic, Cohere, Adept, Character.ai, Inceptive, and Essential AI โ companies that, between them, raised over $20 billion. The paper that broke the world's most valuable industry came from a team that no longer worked together.
The paper credits eight authors โ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, ลukasz Kaiser, and Illia Polosukhin โ listed in 'random order' (per a footnote). The actual work happened over about six months in 2017 inside Google Brain. Uszkoreit had been pushing self-attention since 2014; Shazeer (later co-founder of Character.ai) was the architecture's main optimizer; Vaswani led the experiments; Parmar handled the language-model implementation; Polosukhin (later co-founder of NEAR Protocol) suggested the title. The internal codename was 'Transformer' as in 'transforming sequences.' The paper was submitted to NeurIPS 2017 (the field's top conference), accepted, and presented in December 2017 in Long Beach. By then BERT โ the first widely deployed Transformer โ was already being trained.
Unread picks stay on top. Fresh stories may appear as they are ready โ no extra loading.
NPCI, SBI, and HDFC targeted by ransomware and DDoS in India's worst banking cyberattack year. 580 million data records exposed. State actors from North Korea and China suspected by CERT-In.
Global multinationals now run almost every critical function โ AI research, legal ops, finance, product design โ from India. The GCC industry hit $64 billion in revenue in FY25 and is on track.
A non-profit (NPCI), one API standard, and an RBI policy bet built the world's largest real-time payments rail โ making even small Indian merchants deeply digital.
By 2024, only one of the eight original authors still worked at Google. Vaswani founded Adept (raised $415M) and now Essential AI. Shazeer co-founded Character.ai (raised over $190M before Google licensed back its tech in 2024 for ~$2.7B). Parmar and Vaswani co-founded Adept together. Aidan Gomez founded Cohere (raised over $1.1B). Llion Jones founded Sakana AI in Tokyo (Series A $200M+). Polosukhin co-founded NEAR Protocol. Kaiser joined OpenAI in 2021 and worked on GPT-4. Uszkoreit founded Inceptive (a biological-mRNA startup using Transformers for vaccine design). The diaspora is the story: every major LLM company built since 2020 has at least one Transformer co-author in its founding team.
In 2017, the Transformer paper described a model with 65 million parameters trained on a translation dataset. State-of-the-art AI meant rule-based systems, statistical models, and narrow task-specific networks. ChatGPT did not exist. In 2026, the Transformer architecture underlies models with trillions of parameters that write code, draft legal documents, generate images, compose music, and assist in drug discovery. Microsoft has committed $80 billion to data centers for AI in 2025 alone. India has 400+ AI startups using Transformer-based models. The cost of inference โ running a model โ has fallen 1000x in five years, making AI accessible to small businesses and solo developers. What changed was not just capability but accessibility: an Indian startup in 2026 can build on the same architectural foundation as OpenAI, at a fraction of the cost.
The Transformer's dominance is starting to crack at the seams. Mamba (Albert Gu et al., 2023) and state-space models are showing competitive results with linear, not quadratic, complexity in sequence length โ the Transformer's attention mechanism scales as O(nยฒ) with context, which is why a million-token context window costs 1000ร more than a thousand-token one. Mixture of Experts (MoE) โ used in GPT-4 and Mixtral โ keeps the Transformer skeleton but only activates a fraction of parameters per token, cutting inference cost. Diffusion language models (Inception Labs' Mercury, 2025) are exploring whether the next architectural step is non-autoregressive entirely. Whether one of these dethrones the Transformer or all become hybrid is the most-watched architecture question in the field. But every contender still uses 'attention' somewhere โ the original paper's core idea has outlived the architecture it introduced.
The 2017 Transformer paper is the clearest modern example of how a single scientific contribution can rearrange an entire industry's competitive landscape. The paper itself was Google research, published openly. Google had every advantage โ the talent, the compute, the data, the deployment surfaces. And yet the company that built the first consumer LLM (OpenAI's ChatGPT, 2022) was a competitor that read the same paper. This is the inverse of the 1990s 'first-mover advantage' lesson: when the underlying breakthrough is published openly, execution and risk-tolerance matter more than research lead-time. For India and other countries trying to enter the AI race, the implication is sharp: training-frontier models from scratch needs hundreds of millions in compute, but the architectural ideas are already public. The next breakthrough โ the one that displaces the Transformer โ is likely already on arXiv. Whoever recognizes it first wins the next decade. The future of AI belongs to those who act on public knowledge faster than anyone else.
Chronology
Follow the arc from background to turning points. On mobile, swipe the cards and use the step rail below; on desktop, use the spine to jump.
Hochreiter and Schmidhuber publish Long Short-Term Memory, the recurrent architecture that becomes the dominant language-model building block for the next 20 years.
Bahdanau, Cho, Bengio publish 'Neural Machine Translation by Jointly Learning to Align and Translate' โ the first practical attention mechanism, bolted onto an RNN.
Vaswani et al. submit the Transformer paper. Title is deliberately mild; the architecture replaces recurrence with self-attention and parallelizes training across the full sequence.
Google releases BERT. By Dec 2019, almost every Google Search query passes through a BERT model. Every major NLP benchmark is broken in months.
OpenAI publishes GPT-3 with 175B parameters. The model exhibits zero-shot and few-shot capabilities not present at smaller scales โ the first hint that scale alone produces qualitatively new behavior.
OpenAI's chat product, built on GPT-3.5, hits 100 million users in two months โ the fastest consumer product adoption in history. The Transformer paper, five years old, is suddenly a household conversation.
OpenAI releases GPT-4. Estimated $50-100M training cost. The Transformer architecture has scaled five orders of magnitude in six years โ every modern frontier system is still a Transformer.
Two of the original eight Transformer authors return to Google in a complex licensing deal. The diaspora has now turned full-circle: Google bought back the people it published the architecture with.
Step 1/8 events
Understand why it happened, how we got here, and what might come next.