I understand that we consider D different Wi for each dimension but how is the x N obtained? If it is number of samples i'm not sure why that is true because for every W combination there isn't just a 1 step process to evaluate loss.
Hi, the assumption is calculating the loss for a single value is O(1), so doing it N times is O(N) complexity. O(N) is important as SGD does not require O(N) complexity for each step (whereas grid search does).
Best,
Semih
It is more of a matter of choice. Here we choose to say it has the complexity of O(N) instead of O(1). We chose O(N) to stress that some algorithms do not look at the whole dataset to make a single decision (like SGD). I think an example would make it clearer. Let's say we instead make a new algorithm, which is a batched grid search, which samples B examples randomly from the training set, and performs the evaluation of a single grid location. If we were to assume loss calculation has O(1) complexity, then we cannot really show batched grid search is faster than grid search. However, we can show the complexity reduction if we assume calculating loss has O(N) complexity, then batched one would have O(B) complexity.
Here we choose O(N) as we will compare grid-search with SGD later and SGD indeed does batched computation.
theory question complexity
Hello,
I understand that we consider D different Wi for each dimension but how is the x N obtained? If it is number of samples i'm not sure why that is true because for every W combination there isn't just a 1 step process to evaluate loss.
thanks for your help
Hi, the assumption is calculating the loss for a single value is O(1), so doing it N times is O(N) complexity. O(N) is important as SGD does not require O(N) complexity for each step (whereas grid search does).
Best,
Semih
1
thank you for your quick response, I understood that was the assumption but I am not sure why we can say that calculating the loss has complexity O(1)
1
It is more of a matter of choice. Here we choose to say it has the complexity of O(N) instead of O(1). We chose O(N) to stress that some algorithms do not look at the whole dataset to make a single decision (like SGD). I think an example would make it clearer. Let's say we instead make a new algorithm, which is a batched grid search, which samples B examples randomly from the training set, and performs the evaluation of a single grid location. If we were to assume loss calculation has O(1) complexity, then we cannot really show batched grid search is faster than grid search. However, we can show the complexity reduction if we assume calculating loss has O(N) complexity, then batched one would have O(B) complexity.
Here we choose O(N) as we will compare grid-search with SGD later and SGD indeed does batched computation.
2
Add comment