Precision - TP/ (TP+FP)

  • Ability to reliably reject non-relevant documents
  • Important if costly for non-relevant item to get mistakenly accepted (job applications)

Recall - TP/ (TP + FN)

  • Ability to reliably find all relevant documents
  • Important if costly for relevant items to get missed (safety-critical applications like self driving or medicine)

Harmonic Mean - for averages to be valid, need denominators to have the same scaled units

  • F1 score - Reciprocal of the arithmetic mean of the reciprocals of set of observations Precision-Recall (PR) Curve - plot precision against recall at all top-K values
  • Average Precision (AP) - area under the PR curve
  • mean Average Precision (mAP) - AP averaged across dataset

Shannon Information I(X=x) - intuitively is the level of “surprise” of a probabilistic realization

  • The negative log of probability
  • The higher the likelihood, the lower the shannon information

Shannon Entropy H(X) - expected value of shannon information

  • If H(X) = 0, always same value so no information encoded in data
  • H(X) = moderate, some underlying patterns that allow for information to be learned
  • H(X) = high, many possibilities of values; need many bits to capture data

Cross Entropy H(P,Q) - intuitively, average number of total bits required to encode data coming from a “true” distribution P when we use model Q

  • Cross entropy is minimum when P=Q
  • Same as negative log-likelihood, just interpreted differently - also known as log loss

KL Divergence D(P||Q) - average number of EXTRA bits required to encode data from P when we use Q

  • Characterizes how much more likely x is drawn from P vs Q
  • Essentially, cross entropy H(P,Q) is KL Divergence of (P,Q) plus entropy of P
  • If P is fixed, then minimizing cross entropy is the same as minimizing KL divergence
  • KL has a forward and reverse interpretation
    • Forward KL - mode-covering behavior
    • Reverse KL - mode-seeking behavior
  • Knowledge distillation - training a student to replicate behavior of a teacher - use KL divergence instead of cross entropy since entropy is non-zero
    • Non-zero entropy means “optimal” loss is non-zero and constantly fluctuates depending on batch

MSE Loss (Mean squared error) - typically used for regression while cross entropy usually used for classification

  • Possible to use MSE for classification but MSE + softmax is not convex
  • MSE decomposes to the bias and variance

Bias - bias for a parameter estimator is the difference between expected value and underlying parameter

  • Underfitting if bias is high Variance - how consistent given different training datasets
  • Overfitting if variance is high

Double descent phenomenon - phenomenon challenges bias-variance tradeoff

  • Overly complex model that overfits might perform better than well parameterized models
  • Theory: allows model to find simpler solution that generalizes better, or is better able to fit and distinguish noise from signal

Curse of dimensionality - the larger the dimension, the more sparse the data space given a certain amount of data

  • Data becomes equally spaced which is bad
  • Amount of data to support high dimensionality grows exponentially Blessing of Dimensionality - advantage to working with high dimensionality
  • high dimension can make points separable when a lower dimension cannot

Discriminative vs Generative Models

  • Discriminative models probability of Y given X, P(Y|X) - given feature X, what is the probability of it being in one class or the other?
  • Generative model learns distribution of individual classes - learns PDF of X for all classes of Y