We are working with my team on one of the topics proposed in the mini-project pdf which is "For training deep nets with very large SGD mini-batches: when does the scalability gain disappear? Is it influenced by other properties of the optimizer? For example, what is the effect of first slowly growing the learning rate and then later decreasing it? " and we realized that we have several questions relating to the project and the scope of it.
What do we mean here exactly by scalability, does this relate to distributed neural networks and the number of worker nodes that take part in the SGD? If it is so, do you have any advice on how we should start especially when it concerns the architecture needed to investigate such questions?
In the question asked on the PDF, there is a part about " For example, what is the effect of first slowly growing the learning rate and then later decreasing it? ". In relation to this we started looking Goyal et al. (2017) [https://arxiv.org/abs/1706.02677] when it came to large batch learning schemes, do you think it is a good idea for us to start from that paper and then look at ways to produce different learning schemes? Is that what we're supposed to do?
Thank you in advance
Let me start with emphasizing that those project ideas are just suggestions. It's up to you to define a precise question you want to explore. I think you can study it with any kind of model (doesn't have to be a deep neural network).
What do you want to care about with scalability gain? If you consider distributed training, you can figure out very practically how large you can scale distributed training before communication becomes a bottleneck. If you consider mini-batch SGD on a GPU, you can measure the cost of computing gradients with various batch sizes, and figure out when the gains of increasing the batch size in computation costs gets offset by requiring more gradient updates to converge to full accuracy.
In deep learning (and in the paper you shared), sometimes you cannot get full accuracy at all with large batches, even if you train longer. You can also view the question from this angle. The paper is very well known and seems a great starting point for a project.
In summary: up to you :) Just make sure agree with your group on which aspect of 'scalability' you want to consider, and be consistent.