This post presents a detailed walkthrough of a solution to
Standford CS231n:Convolutional Neural Networks for Visual Recognition course (Spring 2025).
This post is not only to provide working solution code, but to explain the underlying concepts, step by step,
so you can build a deeper unerstanding of how and why the solution works.
You can find the complete solution code on my github repo → click this link.
This post is just about the hard part of the assignment. And please keep in mind that some parts of solution may not be fully correct or understanable. If you spot any mistakes or areas for improvement, feel free to let me know. I'd truly appreciate it.
This post is just about the hard part of the assignment. And please keep in mind that some parts of solution may not be fully correct or understanable. If you spot any mistakes or areas for improvement, feel free to let me know. I'd truly appreciate it.
Index
- Q1:k-Nearest Neighbor classifier
- Q2:Implement a Softmax Classifier
- Q3:Two-Layer Neural Network
- Q4: Higher Level Representations: Image Features (__skip__)
- Q5: Training a fully connected network (__skip__)
- Q1: Batch Normalization
- Q2: Dropout (__skip__)
- Q3: Convolutional Neural Networks
- Q4: PyTorch on CIFAR-10 (__skip__)
- Q5: Image Captioning with Vanilla RNNs (__skip__)
Assignment1
Q1:k-Nearest Neighbor classifier
Implementing no-loop

np.sum()
and np.square()
.
For cross term, since it involves the sum of element-wise multiplications, it can be converted to a dot product.
However, be careful to transpose one of the matrices to match alignment (see picture below).
Q2:Implement a Softmax Classifier
Numeric stability

The exponentiating each score using the natural constant lead to large values and incur numeric instability. To address this issue, subtract the maximum score across all scores. This operation doesn't affect loss function because dividing both numerator and denominator by the same value leaves the result unchanged. Addtionally, should add a very small epsilon after subtracting to avoid taking logarithm of zero.
Computing gradient

X = [3.0, -2.6, 2.0]
W = np.random([3,3])
prb = [0.2, 0.6, 0.1]
Y = [1,0,0]
dW[0] = -X + X*0.2
dW[1] = X*0.6
dW[0] = X*0.1
Implementing no-loop gradient

We can implement multiplication between input data and corresponding probability with simple dot product between input batch data and the probability matrix. This matrix contains the predicted probability scores for each class across all samples. To implement subtraction for correct class in the gradient, we need to slightly modify the probability matrix. Specifically, we subtract
1
at the position of corresponding to correct class bel for each sample.
As shown in picture abov,e if the correct label for the first sample is class 2, then -1
is added (not replaced) at the position in the matrix.
Q3:Two-Layer Neural Network
Computing gradient
The best way to understand fully-connected layer is writing down all computations manually. Let's walk through a simple example where 2x3 input vector and 3x2 weight vector. Assume2x2 dout
comes from the upstream layer in the network.

The remaining part mainly involves assembling the components and just setting up the training loop, which are relatively straightforward, so I'll skip over them here.
Assignment2
Q1: Batch Normalization
Getting gradient of gamma and beta

Getting gradient of input

Q3:Convolutional Neural Networks
im2col (Image to Column)

Getting gradient dW

Getting gradient dX

The gradient with respect to input X is similiar excep that the weight acts as filter and dilation is applied to upstream gradient. Addtionally an extra one-pixel zero-padding is added around the upstream gradient. The illustration above demonstrates how to compute dX for a stride 2 case with 2x2 weight filter. Note that the resulting dX will include regions corresponding to the added zero-padding. These padded area are not part of the original input and must be explicitly trimmed to be the correct gradient.
Assignment2
Q1: Batch Normalization
Getting gradient of gamma and beta

Getting gradient of input

Q3:Convolutional Neural Networks
im2col (Image to Column)

Getting gradient dW

Getting gradient dX

The gradient with respect to input X is similiar excep that the weight acts as filter and dilation is applied to upstream gradient. Addtionally an extra one-pixel zero-padding is added around the upstream gradient. The illustration above demonstrates how to compute dX for a stride 2 case with 2x2 weight filter. Note that the resulting dX will include regions corresponding to the added zero-padding. These padded area are not part of the original input and must be explicitly trimmed to be the correct gradient.
Assignment3
Q1:Image Captioning with Transformers
Implementing Multihed Self-Attention Layer

Q2:Self-Supervised Learning for Image Classification
Implementing vectorized SimCLR loss
