Coins

Probability of {{ kkGeometricFailures }} tail(s) before a head: {{ geometricPMF(pp, kkGeometricFailures) | percent4 }}


Probability of {{ rrNegBinFailures + kkNegBinSuccesses }} coin tosses before {{ kkNegBinSuccesses }} heads: {{ negativeBinomialPMF(pp,rrNegBinFailures,kkNegBinSuccesses) | percent4 }}



Probability of {{ kkBinomialSuccesses }} successes in {{ nnBinomialTrials }} coin tosses: {{ binomialPMF(pp, nnBinomialTrials, kkBinomialSuccesses) | percent4 }}.



Probability of taking r red objects from a mixed bag of red and green objects: {{ hypergeometricPMF(pp, nnHyperGeometricTotal, mmHyperGeometricSuccesses, nnHyperGeometricDraws) | percent4 }}.




Math
To be completed soon...

A fair coin has an odds of 1:1 or a probability of ½ but a biased coin could have any value from 0 to 1.

The Bernoulli probability distribution

I think of this as the coin toss distribution. It models the probability of a outcomes that are binary such as yes-no or true-false questions.

The random variable \(X \in \{0,1\}\). $$P(X=1) = p \text{ and } P(X=0) = q = (1-p)$$

Binomial probability distribution


About

The Binomial distribution models the number of successes k in a given number of trials n.

with replacement ...

$$P(x;k,p) = C(n,k)p^kq^{n-k}$$ $$q = (1-p)$$ Where the number of combinations is given by: $$C(n,k) = {}_n \mathrm{ C }_k = \frac{n!}{k!(n-k)!} \text{ where } k\lt n $$ $$(1+X)^n= \sum_{k\ge 0} C(n,k) X^k$$ Number of permutations: $$P(n,k) = {}_n \mathrm{ P }_k= \frac{n!}{(n-k)!}$$

Negative binomial probability distribution


About

The negative binomial distribution (Pascal) models the number \(n\) of repeated trials to produce \(r\) successes. Examples include flipping a coin multiple times until we have a certain number of heads; or a sales quota where one has to keep selling until a certain number of successful sales are made.

The probability of success in any individual trial is given by \(p\).

$$P(x;k,p) = C(n-1,r-1)p^kq^{n-k} \ FIX$$ Where the combinations is given by: $$C(n,k) = {}_n \mathrm{ C }_k = \frac{n!}{k!(n-k)!} \text{ where } k\lt n $$ $$(1+X)^n= \sum_{k\ge 0} C(n,k) X^k$$ Number of permutations: $$P(n,k) = {}_n \mathrm{ P }_k= \frac{n!}{(n-k)!}$$
  • TODO why the name negative binomial, to do with negative coefficients.
  • TODO average length of stay.
  • Geometric Probability Distribution

    A special case of the negative binomial distribution where the number of successes \(r = 1\). Given a series of trials where each trial can only succeed or fail, with the same probability for each trial. The geometric probability distribution gives the probability of the numbers of failures before the first success.

    For a sequence of N coin tosses where the coin might be biased (so long as it is consistently biased). If \(p\) is the probability of success and \(q=(1-p)\) is the probability of failure, then the probability of \(k\) failures before one success is: $$P(Y = k) = q^kp \text{ where } k \in \mathbb{N}_{\ge0}$$ This can be seen as a sequence of Bernoulli trials.

    PGF of the Geometric Probability Distribution

    Definition of probability generating function (PGF) of a discrete random variable is: $$G(z) = \sum_{x\ge 0} p_X(x)z^x$$ Definition of geometric distribution: $$p_X(k) = qp^k \text{ where } q=(1-p), k \in \mathbb{N}_{\ge0}$$ $$G(z) = \sum_{k\ge 0} qp^kz^k = q\sum_{k\ge 0} p^kz^k = q\frac{1}{1-pz}$$ Since by the sum of the infinite geometric sequence: $$ \frac{1}{1-v} = \sum_{x\ge 0} v^k$$

    Prosecutor's fallacy and Bayes' theorem
    complete explanation

    If a defendant has the same looks as the perpetrator and only 1% of the population have those looks, then the prosecutor would say the likelihood of guilt is 99%.

    This is not true at all ...

    If there are say 10,000 other people in the same city of a million people with those looks, there are 9,999 other suspects. In this case the incriminating evidence is a persons looks.

    P(guilty) = 1/1,000,000 = .000001
    P(looks) = 10,000/1,000,000 = 1%
    P(looks|guilt) = 100%
    P(guilty|looks) = P(looks|guilty) * P(guilty)/P(looks) = 100% * .000001 / 1% = {{ (.000001/.01) | percent2 }}

    This is the prosecutor's fallacy which involves mixing up the probability that someone is guilty given the incriminating evidence with the probability of the incriminating evidence given someone is guilty. This has really happened for example in the Sally Clark case.

    P(E|G) is the probability that damning evidence would be observed the accused was guilty, a true positive. P(G|E) is the probability of guilt given damning evidence.

    Use the following to show that P(E|G) can be quite different from P(G|E). Note that P(G AND E) is always less than P(E)×P(G).

    P(E|G) {{ percent(pEvidenceGivenGuilt) }} $$P(E|G) = \frac{P(G\ \mathit{and}\ E)}{P(G)}$$ P(G|E) {{ percent(pGuiltGivenEvidence) }} $$P(G|E) = \frac{P(G\ \mathit{and}\ E)}{P(E)}$$ P(G AND E) {{ percent(pEvidenceAndGuilt) }} P(E) {{ percent(pEvidence) }} P(G) {{ percent(pGuilt) }}

    Math

    $$P(accusation|evidence) \ne P(evidence|accusation)$$

    Bayes: $$P(A|B) = \frac{P(A\ and\ B)}{P(B)}$$ $$P(B|A) = \frac{P(A\ and\ B)}{P(A)}$$

    Divide one by the other: $$\frac{P(A|B)}{P(B|A)} = \frac{P(B)}{P(A)}$$

    So $$P(A|B) = \frac{P(B|A)P(B)}{P(A)}$$ $$P(B|A) = \frac{P(A|B)P(A)}{P(B)}$$

    The above implies the inequality: $$P(B|A) \le P(A\ and\ B)$$

    The base rate is the unconditional probabilities - ask out of how many?

    Contingency table
    Actual Condition (Guilty)
    True False Prevelance {{ prevelance | percent2 }} Accuracy {{ accuracy | percent2 }}
    Predicted
    Condition (Evidence)
    True
    True positives (TP) {{ tp }}

    False positives (TP) {{ fp }}
    PPV {{ ppv | percent2 }} FDR {{ fdr | percent2 }}
    False
    False negatives (FN) {{ fn }}

    True negatives (TN) {{ tn }}
    FOR {{ FOR | percent2 }} NPV {{ npv | percent2 }}
    TPR {{ tpr | percent2 }} FPR {{ fpr | percent2 }} LR+ {{ lrp | round2 }} DOR {{ dor | round2 }} F1Score {{ f1Score | round2 }}
    FNR {{ fnr | percent2 }} TNR {{ tnr | percent2 }} LR- {{ lrn | round2 }}
    Details
    ROC curve

    True positives and false positives

    In any test there will be errors. true positives (TP) are those that are true and test true, the convicted guilty; true negatives (TN) are those that are false and test false, the innocent acquitted; false positives (FP) are false but test true, the convicted innocents; and false negatives (FN) are true but test false, the guilty acquitted.

    A False Positive is known as a Type 1 error and a False Negative is known as a Type 2 error.

    True Positive Rate (TPR) = TP/P where P=FN+TP

    False Positive Rate (FPR) = FP/N where N=FP+TP, α type 1 error

    True Negative Rate (TNR) = TN/N

    False Negative Rate (FNR) = FN/P, β type 2 error

    Accuracy and prevelance

    Accuracy = (TP+TN)/total where total=TP+TN+FP+FN

    Prevalence = (TP+FN)/total where total=TP+TN+FP+FN

    Specificity and sensitivity

    Specificity is TNR and Sensitivity is TPR ...

    Sensitivity is the percentage of positives correctly identified, the number of sick people correctly identified as such. Specificity is the percentage of negatives correctly identified, the number of healthy people correctly identified as such.

    A more sensitive test will have fewer Type 1 errors ...

    Positive likelihood ratio (LR+) = TPR/FPR ... = Sensitivity/(1-Specificity)

    Negative likelihood ratio (LR-) = FNR/TNR ... = (1- Sensitivity)/Specificity

    Diagnostic odds ratio (DOR) = LR+/LR- ...

    Relevance: Precision and Recall

    Precision or Positive Predictive Value (PPV) = TP/(TP+FP)

    Precision is PPV and recall is TPR ...

    F1 score = 2 * precision*recall / (precision+recall)

    In Information retrieval and web search. Precision is the percentage of relevant documents among all those retrieved. Recall is the percentage relevant documents retrieved of all relevant documents.

    ...

    Negative Predictive Value (NPV) = TN/(TN+FN).

    Complements of PPV and NPV are: False ommision rate (FOR) = 1-NPV and False discovery rate (FDR) = 1-PPV.

    Diagnostic odds ratio (DORS)

    DORS ...

    Probability

    A random experiment is an activity with observable outcomes. A sample space is the set of all possible outcomes. An event is a subset of the sample space and each event has a probability.

    An event, E, is a subset of the sample space, Ω, which is the set of all possible outcomes of a random experiment.

    So an event E is a set of outcomes, and the event E occurs is the outcome of the random experiment is a member of E. $$outcome \in E \implies Event\ E\ Occurred$$

    Probability is always between 0 and 1:

    $$P(E) = \frac{n(E)}{n(\Omega)} = \frac{number\ of\ elements\ in\ E}{number\ of\ elements\ in\ \Omega}$$

    The complement of the set E: $$E^c = \Omega-E$$ $$P(E^c) = 1-P(E)$$

    De Morgan's Law: $$A \cap B = A^c \cup B^c$$

    The number of elements in the union of A and B is the number in A plus those in B minus the number in both: $$n(A\cup B) = n(A) + n(B) - n(A\cap B)$$ $$\therefore P(A\cup B) = P(A) + P(B) - P(A\cap B)$$ $$or\ P(A\ and\ B) = P(A)P(B) - P(A\ or\ B)$$

    ...

    Probability distributions

    A probability distribution is a function giving the probabilities of different events.

    A discrete probability distribution is defined by it's probability mass function (PMF) while that of a continuous probability distribution is defined by it's probability density function (PDF). Given a value a PMF will return the probability of that value occurring.

    The PDF does not actually give a probability since a value is not an event ... If you drop a dart on to dart board, there might be 100% probability that the dart will hit the board, but effectively zero probability it will hit any particular spot ...

    The values of a random variable depends on a random phenomenon according to a probability distribution.

    Typically there are continuous random variables and discrete random variables. The difference between discrete and continuous can be seen as the difference between countable and non-countable. The english languages makes the distinction between many and much as is many stones but much water. Discrete random variables might take integer values and continuous random variables might take real values.

    The cumulative probability distribution function (CDF), aka distribution function, is the sum or integral of the probabilities of a ranges of possible values. The sum or integral of all values must of course add up to 100%. So the CDF will always return a value between 0 and 1.

    For a continuous random variable the CDF is: $$ F_X(x) = \int_{-\infty}^x f_X(t) dt $$

    A continuous probability distribution is also completely defined by it's characteristic function which is the fourier transform of the PDF.

    A continous probability distribution may also be defined by it's moment-generating function which is the laplace transform of the PDF, this may not always exists (complex numbers ...). The mean is first moment, the variance second moment, skewness third moment, kurtosis fourth moment ...

    A discrete probability distribution may also be defined by it's probability generating function (PGF) which is the z-transform of it's PMF. Think of the z-transform as a discrete version of the laplace transform.

    Odds

    Odds is another way a state probability, for instance and odds of 1:1 is a probability of 50%, 1:2 is ~.67%: $$odds = \frac{p}{1-p}$$ $$(1-p)odds = p$$ $$odds - p odds = p$$ $$odds = p + p odds$$ $$odds = p(1+ odds)$$ $$\therefore p = \frac{odds}{1 + odds}$$

    Odds ratio
    Likelihood

    The likelihood function is the probability of the evidence given the parameter, \(P(\theta|X)\), whereas the posterior probability is the probability of the parameter given the evidence, \(P(X|theta)\). The prior probability distribution with an unknown parameter, \(P(\theta)\), is a belief about the distribution before evidence. These are related by Bayes theorem:

    $$ P(\theta|x) = P(X|\theta) \frac{p(\theta)}{p(x)}$$ ...

    The likelihood ratio involves a hypothesis test of whether the parameter θ is one of two values. By dividing two likelihood functions, the unknown p(x) cancels out. $$\Lambda(x) = \frac{P(\theta_0|x)}{P(\theta_1|x)}$$

    The probability an event occurs is how many times one expects to see the event given many trials.

    With probability we start by assuming models and parameters and to infer outcomes. With statistics We start with outcomes and infer parameters and models. Statistics is rather probability backwards. ...

    Traditional statistics might involve an expensive survey generating relatively small amounts about very heterogeneous items like people with collection errors. Machine learning involves massively more volumes of data, with few collection errors about very homogeneous items. ...

    Unsupervised learning is like statistical inference, one tries to infer a model from data. Supervised learning is more like regression, comparing datasets for best fit, for predicting new events. ...

    Sum of Geometric Series

    Sequence and series

    A sequence is an ordered list of numbers: 2,4,6,8,... and a series is a sum of terms in a sequence: 2+4+6+8,+... .

    Geometric Series

    A geometric series with common ratio \(\frac{1}{2}\): $$\frac{1}{2} + \frac{1}{4} + \frac{1}{4} + \frac{1}{4} + ...$$

    With common ratio v: $$a + av + av^2 + av^3 + ...$$

    Sum of finite Geometric Series

    $$s = \sum_{n\ge 0}^N v^n = \frac{1-v^{n+1}}{1-z}$$

    Sum of infinite Geometric Series

    If \(|z|\lt1\) then \(\lim_{N\to \infty} v^N = 0\), so: $$s = \sum_{n\ge 0}^N v^n = \frac{1}{1-z}$$

    Gamma PDF

    Gamma function

    The factorial function says how many permutations there are of n distinct objects. If you have n books there are n! = n×n-1×n-2×... ways to arrange them on a shelf. For instance the letters a, b and c can be arranged in 3!=3×2×1=6 different ways: abc, acb, bac, bca, cba, cab. The Gamma function extends or interpolates the factorial function to real (and complex) numbers.

    The Gamma probability distribution is used to model variables that are positive, continuous and have skewed distributions If the shape is more than one then the distribution is mounded and skewed else if less than one there is no mound but asymptotic in both axes. ...

    Linear Regression

    A linear equation: $$y = mx+c$$

    An observation at a point \(x_i, y_i\) that is above, below or on the line by a vertical distance \(e_i = y-y_i\) $$y_i = mx_i+c + e_i$$

    Rearrange so that \(e_i\) is on the left hand side (LHS): $$e_i = y_i - mx_i -c $$

    Add up the square of all 'n' observations (Σ means sum): $$\sum e_i^2 = \sum (y_i - mx_i+c)^2 $$

    Let \(E= \sum e_i^2\), this is the sum of squared errors or residuals. If we just added up the errors without squaring, some would be positive and some negative, and would cancel out.

    We want values of \(m\) and \(c\) that minimize \(E\) so we take partial derivatives of E and set them to zero (where the derivative is zero is a minima): $$\frac{\partial E}{\partial m} = \sum(y_i - mx_i+c)x_i = 0$$ $$\frac{\partial E}{\partial c} = \sum(y_i - mx_i+c) = 0$$

    Rearranging gives: $$\sum y_i x_i - m \sum x_i^2 + c \sum x_i = 0$$ $$\sum y_i - m \sum x_i + nc = 0$$

    This gives the normal equations for the least squares fit: $$\sum y_i x_i = m \sum x_i^2 + c \sum x_i$$ $$\sum y_i = m \sum x_i + nc$$

    ...