What does batch refer to in training? In inference, I can imagine how it works, but I'm not sure how the reverse of it would somehow work for training. Do they use a hack to make it work, and that's why we're arguing that large batch size is bad?