L2 regularization only makes weights small, not zero
Feature crosses lead to lots of input nodes, so having zero weights is especially important
L0-norm(the count of non-zero weights) is an NP-hard, non-convex optimization problem
L1 norm(sum of absolute values of the weights) is convex and efficient; it tends to encourage sparsity in the model
There are many possible choices of norms
Elastic nets combine the feature selection of L1 regularization with the generalizability of L2 regularization
Transform linear regression by a sigmoid activation function
The output of Logistic Regression is a calibrated probability estimate
Useful because we can cast binary classification problems into probabilistic problems: Will customer buy item? becomes Predict the probability that customer buys item
Typically, use cross-entropy (related to Shannon’n information theory) as the error metric
Less emphasis on errors where the output is relatively close to the label.
Regularization is important in logistic regression because driving the loss to zero is difficult and dangerous
Weights will be driven to -inf and +inf the longer we train
Near the asymptotes, gradient is really small
Often we do both regularization and early stopping to counteract overfitting
In many real-world problems, the probability is not enough; we need to make a binary decision
Choice of threshold is important and can be tuned
Use the ROC curve to choose the decision threshold based on decision criteria
The Area-Under-Curve(AUC) provides an aggregate measure of performance across all possible classification thresholds
AUC helps you choose between models when you don’t know what decision threshold is going to be ultimately used.
“If we pick a random positive and a random negative, what’s the probability my model scores them in the correct relative order?”
Logistic Regression predictions should be unbiased
average of predictions == average of observations
Look for bias in slices of data. this can guide improvements
Use calibration plots of bucketed bias to find slices where your model performs poorly
Feature crosses help linear models work in nonlinear problems
But there tends to be a limit…
Combine features as an alternative to feature crossing
Structure the model so that features are combined Then the combinations may be combined
How to choose the combinations? Get the model to learn them
A Linear Model can be represented as nodes and edges
Adding a Non-Linearity
Our favorite non-linearity is the Rectified Linear Unit (ReLU)
There are many different ReLU variants
Neural Nets can be arbitrarily complex
Hidden layer - Training done via BackProp algorithm: gradient descent in very non-convex space
To increase hidden dimension, I can add neurons
To increase function composition, I can add layers
To increase multiple labels per example, I can add outputs
Training Neural Networks
DNNRegressor usage is similar to LinearRegressor
Use momentum-based optimizers e.g. Adagrad(the default) or Adam.
Specify number of hidden nodes.
Optionally, can also regularize using dropout
Three common failure modes for gradient descent
There are benefits if feature values are small numbers
Roughly zero-centered, [-1, 1] range often works well
Small magnitudes help gradient descent converge and avoid NaN trop
Avoiding outlier values helps with generalization
We can use standard methods to make feature values scale to small numbers
Hard cap (clipping) to max, min
Dropout layers are a form of regularization
Dropout simulates ensemble learning
Typical values for dropout are between 20 to 50 percent
The more drop out, the stronger the regularization
0.0 = no dropout regularization
Intermediate values more useful, a value of dropout=0.2 is typical
1.0 = drop everything out! learns nothing
Dropout acts as another form of Regularization. It forces data to flow down multiple paths so that there is a more even spread. It also simulates ensemble learning. Don’t forget to scale the dropout activations by the inverse of the keep probability. We remove dropout during inference.
Multi-class Neural Networks
Logistic regression provides useful probabilities for binary-class problems
There are lots of multi-class problems
How do we extend the logits idea to multi-class classifiers?
Idea: Use separate output nodes for each possible class
Add additional constraint, that total outputs = 1.0
Use one softmax loss for all possible classes
Use sotfmax only when classes are mutually exclusive
“Multi-Class, Single_label Classification”
An example may be a member of only one class.
Are there multi-class setting where examples may belong to more than one class?
If you have hundreds or thousands of classes, loss computation can become a significant bottleneck
Need to evaluate every output node for every example
Approximate versions of softmax exist
Candidate Sampling calculates for all the positive labels, but only for a random sample of negatives: tf.nn.sampled_softmax_loss
Noise-contrastive approximates the denominator of softmax by modeling the distribution of outputs: tf.nn.nce_loss
For our classification output, if we have both mutually exclusive labels and probabilities, we should use tf.nn.softmax_cross_entropy_with_logits_v2.
If the labels are mutually exclusive, but the probabilities aren’t, we should use tf.nn.sparse_softmax_cross_entropy_with_logits.
If our labels aren’t mutually exclusive, we should use tf.nn.sigmoid_cross_entropy_with_logits.