Tags adam1 adamw1 ai1 analysis1 article1 batch-norm1 bayesian1 bengio1 book-review4 books1 building1 deep learning2 discussion-group1 dropout1 embeddings1 essays1 gradient descent1 hotwheels1 humanities1 knowledge1 language models1 layer-norm1 literature1 llm1 lstm1 marketplace1 momentum1 neural networks4 nlp2 normalization1 optimization1 overfitting1 poem1 regularization1 research1 rms-norm1 rnn1 sequence modeling1 sgd1 short-stories1 startups1 training1 training-stability1 uncertainty1 vanishing gradient1 year-in-books4