References
Amanzhol Salykov. (2025, January 12). Advanced Matrix Multiplication Optimization on NVIDIA GPUs. Salykova. https://salykova.github.io/sgemm-gpu
Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Evgeni Burovski, Chauhan, G., Anjali Chourdia, Constable, W., Alban Desmaison, DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., & Hirsh, B. (2024). PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. ACM, ASPLOS’2024. https://doi.org/10.1145/3620665.3640366
Bach, F. (2024). Learning Theory from First Principles. MIT Press. https://www.di.ens.fr/~fbach/ltfp_book.pdf
Boehm, S. (2022, December 31). How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog. Siboehm.com. https://siboehm.com/articles/22/CUDA-MMM
Bright, P., Edelman, A., & Johnson, S. G. (2025). Matrix Calculus (for Machine Learning and Beyond). ArXiv.org. https://arxiv.org/abs/2501.14787
Chan, S. H. (2021). Introduction to Probability for Data Science. Michigan Publishing. https://probability4datascience.com/
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., & Krishnamurthy, A. (2018). TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. https://doi.org/10.48550/arxiv.1802.04799
Dao, T. (2023, July 17). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. ArXiv.org. https://arxiv.org/abs/2307.08691
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022, June 23). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. ArXiv.org. https://doi.org/10.48550/arXiv.2205.14135
Darve, E., & Wootters, M. (2021). Numerical Linear Algebra with Julia. SIAM. https://ericdarve.github.io/NLA
Demmel, J. W. (1997). Applied numerical linear algebra. Society For Industrial And Applied Mathematics.
Dongarra, J., J. Du Croz, Sven Hammarling, & Duff, I. S. (1990). A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16(1), 1–17. https://doi.org/10.1145/77626.79170
Dongarra, J., J. Du Croz, Sven Hammarling, & Hanson, R. J. (1988a). Algorithm 656: An Extended Set of Basic Linear Algebra Subprograms: Model Implementation and Test Programs. ACM Transactions on Mathematical Software, 14(1), 18–32. https://doi.org/10.1145/42288.42292
Dongarra, J., J. Du Croz, Sven Hammarling, & Hanson, R. W. (1988b). An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 14(1), 1–17. https://doi.org/10.1145/42288.42291
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. The MIT Press. https://www.deeplearningbook.org/
Gordić, A. (2025, October 29). Inside NVIDIA GPUs: Anatomy of high performance matmul kernels - Aleksa Gordić. Aleksagordic.com. https://www.aleksagordic.com/blog/matmul
Goto, K., & Geijn, R. A. van de. (2008). Anatomy of High-Performance Matrix Multiplication. ACM Transactions on Mathematical Software, 34(3), 1–25. https://doi.org/10.1145/1356052.1356053
Goto, K., & Van De Geijn, R. (2008). High-Performance Implementation of the Level-3 BLAS. ACM Transactions on Mathematical Software, 35(1), 1–14. https://doi.org/10.1145/1377603.1377607
Güneş Baydin, A., Pearlmutter, B., Siskind, J., Baydin, G., Radul, A., & Mark, J. (2018). Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research, 18, 1–43. https://www.jmlr.org/papers/volume18/17-468/17-468.pdf
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical learning, Second Edition: Data mining, inference, and Prediction (2nd ed.). Springer. https://hastie.su.domains/ElemStatLearn/
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., & Gérard-Marchant, P. (2020). Array Programming with numpy. Nature, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2
Hennessy, J. L., Patterson, D. A., & Christos Kozyrakis. (2025). Computer Architecture. Morgan Kaufmann.
Hwu, W.-M. W., Kirk, D. B., & Hajj, I. E. (2022). Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann.
James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An Introduction to Statistical Learning. Springer. https://www.statlearning.com/
Jurafsky, D., & H. Martin, J. (2026). Speech and Language Processing. Stanford.edu. https://web.stanford.edu/~jurafsky/slp3/
Klein, P. N. (2013). Coding the Matrix : Linear Algebra Through Applications to Computer Science. Newtonian Press. https://codingthematrix.com/
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Cody Hao Yu, Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. https://doi.org/10.1145/3600006.3613165
Lambert, N. (2026). RLHF Book. Rlhfbook.com. https://rlhfbook.com/
Lawson, C. L., Hanson, R. J., Kincaid, D. R., & Krogh, F. T. (1979). Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software, 5(3), 308–323. https://doi.org/10.1145/355841.355847
Marc Peter Deisenroth, A Aldo Faisal, & Cheng Soon Ong. (2020). Mathematics for Machine Learning. Cambridge University Press. https://mml-book.github.io/book/mml-book.pdf
Minsky, M., Papert, S., & Léon Bottou. (2017). Perceptrons : An Introduction to Computational Geometry. The MIT Press.
Mitchell, T. M. (1997). Machine learning. Mcgraw-Hill. https://www.cs.cmu.edu/~tom/files/MachineLearningTomMitchell.pdf
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
Nakatsukasa, Y. (n.d.). Numerical Linear Algebra. Retrieved March 17, 2026, from https://courses.maths.ox.ac.uk/pluginfile.php/105965/mod_resource/content/35/NLA_lecture_notes.pdf
Ng, A., & Ma, T. (2023). CS229 Lecture Notes. https://cs229.stanford.edu/main_notes.pdf
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., & Bai, J. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Neural Information Processing Systems. https://papers.nips.cc/paper_files/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., & Amarasinghe, S. (2013). Halide. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. https://doi.org/10.1145/2491956.2462176
Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning.
Raschka, S. (2026). Build a Reasoning Model (From Scratch). Simon and Manning.
Roberts, D. A., Yaida, S., & Hanin, B. (2022). The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge University Press. https://deeplearningtheory.com/
seb-v. (2025, January 20). Optimizing Matrix Multiplication on RDNA3: 50 TFlops and 60% Faster Than rocBLAS. Seb-v. https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. (2024, July 24). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. ArXiv.org. https://arxiv.org/abs/2407.08608
Shalizi, C. R. (n.d.). Advanced Data Analysis from an Elementary Point of View. https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
Shalizi, C. R. (2015). Modern Regression Lecture Notes. Cmu.edu. https://www.stat.cmu.edu/~cshalizi/mreg/15/
Shankhdhar, P. (2024, November 29). Outperforming cuBLAS on H100: a Worklog. Substack.com. https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog
Smith, T. M., Geijn, R. van de, Smelyanskiy, M., Hammond, J. R., & Zee, F. G. V. (2014, May 1). Anatomy of High-Performance Many-Threaded Matrix Multiplication. IEEE Xplore. https://doi.org/10.1109/IPDPS.2014.110
Spector, B. F., Arora, S., Singhal, A., Fu, D. Y., & Ré, C. (2024, October 27). ThunderKittens: Simple, Fast, and Adorable AI Kernels. ArXiv.org. https://arxiv.org/abs/2410.20399
Spector, B., Singhal, A., Arora, S., & Re, C. (2024, May 12). GPUs Go Brrr. Stanford.edu. https://hazyresearch.stanford.edu/blog/2024-05-12-tk
Strang, G. (2023). Introduction to Linear Algebra. Wellesley-Cambridge Press. https://math.mit.edu/~gs/linearalgebra/
Sutton, R. S., & Barto, A. (2018). Reinforcement learning: An introduction (2nd ed.). The MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Tillet, P., Kung, H.-T., & Cox, D. G. (2019). Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. https://doi.org/10.1145/3315508.3329973
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. Society For Industrial And Applied Mathematics.
Valiant, L. (2014). Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World. Basic Books, A Member Of The Perseus Books Group.
Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., & Sheng, Y. (2023). SGLang: Efficient Execution of Structured Language Model Programs. ArXiv.org. https://arxiv.org/abs/2312.07104
Zadouri, T., Hoehnerbach, M., Shah, J., Liu, T., Thakkar, V., & Dao, T. (2026). FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling. ArXiv.org. https://arxiv.org/abs/2603.05451