Suppose we want to classify movie review text as (1) either positive or negative sentiment, and (2) either action, comedy, or romance movie genre. To perform these two related classification tasks, we use a neural network that shares the first layer but branches into two separate layers to compute the two classifications. The loss is a weighted sum of the two cross-entropy losses.
\[
h = \text{ReLU}(W_0x + b_0)
\]
\[
\hat{y}_1 = \text{softmax}(W_1h + b_1)
\]
\[
\hat{y}_2 = \text{softmax}(W_2h + b_2)
\]
\[
J = \alpha \text{CE}(y_1, \hat{y}_1) + \beta \text{CE}(y_2, \hat{y}_2)
\]
Here, input \(x \in \mathbb{R}^{10}\) is some vector encoding of the input text, label \(\hat{y}_1 \in \mathbb{R}^2\) is a one-hot vector encoding the true sentiment, label \(\hat{y}_2 \in \mathbb{R}^3\) is a one-hot vector encoding the true movie genre, \(h \in \mathbb{R}^{10}\) is a hidden layer, \(W_0 \in \mathbb{R}^{10 \times 10}\), \(W_1 \in \mathbb{R}^{2 \times 10}\), \(W_2 \in \mathbb{R}^{3 \times 10}\).
When we train this model, we find that it underfits the training data. Now, let's chat about why underfitting might be happening and toss in a suggestion to spice things up a bit.
- Increasing dimensions of the hidden layer.
-
Adding more layers to the neural network.
-
Splitting the model into two with more overall parameters.
-
Reduce the traning Data