The Next 1000x Cost Saving of LLM

LLM inference costs have dropped a lot over the past three years. At a comparable quality to early ChatGPT, average per-token prices are ~1000x lower, as observed in an a16z blog. That drop has been driven by advances across the stack: better GPUs, quantization, software optimizations, better models and training methods, and open-source competition driving down profit margins.

A Dive into LLM Quantization

Quantization has been adopted by many model providers to reduce inference cost. In my previous LLM system blog, I briefly covered quantization. This article is a more detailed review of modern quantization formats and methods, plus an informal analysis of how quantization is done by leading AI labs.

Beyond AI Taking Jobs: When Economy Needs No Human Consumer

When people consider AI’s threats, unemployment tops the list. This concern is valid, but it fails to distinguish AI from past technological advances, like mechanized textile production displacing hand-weavers, ATMs replacing bank tellers. The AI revolution is not “yet another technology advance”. To explain it, we first examine the role of human consumers in today’s economy.

Psychology of Intelligence Analysis

I have always wondered how information analysis works in government intelligence agencies or corporate business intelligence departments. A few weeks ago, I saw someone recommend Psychology of Intelligence Analysis, so I read through it with the help of LLM tools. It’s an interesting but somewhat verbose book that, in my opinion, could be well-summarized in two or three articles.

Understanding LLM System with 3-layer Abstraction

Performance optimization of LLM systems requires a thorough understanding of the full software stack. Somehow I couldn’t find a comprehensive article that covers the big picture yet, so instead of waiting for one, I decided to write this article. This article is not a comprehensive review or best practice guide, but rather a sharing of my overall perspective on the current LLM system landscape.

Develop Hardware-Efficient AI without being a Hardware Expert

Disclaimer: This blog was originally published on OmniML’s website in November 2022. Since the release of ChatGPT, ML industry has greatly shifted priorities and a lot of the previous assumptions have changed. Neverthless, I still kept the blog for historical reference to the pre-LLM-era view of ML systems.

Understand Autograd - A Bottom-up Tutorial

You may have wondered how autograd actually works in imperative programming. In this post, I am going to explain it with hand-by-hand examples. Unlike other tutorials, this post is not borrowing one single line of codes from PyTorch or MXNet, but instead building everything from scratch.