Random Variables: One Point Post

A variable is something that represents change (it can vary). A random variable is not really a variable. It represents a closed box that we can take values from but can never predict the next value. This is a super critical differentiation to realise as our life is full of random variables that we need to reason about.

A random variable can only describe the specific statistical distribution or selection logic it represents. We can only hope to talk about specific values it takes using the framework of probability.

Variables

In Python you would declare a variable and change its value at any time in the program:

x: int = 100
# some processing
x = 200

You can also have complex variable types such as lists that represent a group of values which can also be manipulated freely:

y: list[int] = [1, 2, 3, 5, 7, 11]
# some processing
y.append(13)

Then you have variables that have a value that is decided at runtime based on data that is fed to the program.

x = f(a, b) #Value of x depends on value of a and b, and the nature of f.

In maths you can define a variable with ease:

Let x = 2 and y = 4 therefore x + y = 6
# some other statements
x + y = 20 <-- can never happen unless I reset x or y or both.

When dealing with variables we can test them for consistency by reusing variables for different operations with the same value. Like in the maths problem above the variable must retain its value till it is changed.

Imagine the chaos if this were to happen in python:

x: int = 100
print(x+1) #101
print(x+1) #42 <-- what?

This brings us to an important point around variables:

Variables are bound to values when involved in any kind of processing (e.g., mathematical operations like add or computing operations like filtering). The variable value cannot change mid processing.

This is why programming languages like Rust are careful about variable mutability and most languages will complain if the underlying complex variable like a list changes while it is being processed.

Random Variables

A random variable in maths would be written as:

X ~ N(0,1)
Where N(0,1) represents the Standard Normal Distribution with mean = 0 and variance = 1

Note there is no ‘=’ between the left and right hand side. The ‘~’ is read as: ‘distributed as’.

Here X is not a variable bound to a value, it is a random variable bound to a value generating engine (defined by N(0,1)).

Once you start materialising values from a random variable you are collecting ‘samples’. So visualise this as running the engine in a loop – each loop gives you one value sampled from the distribution being used by the engine.

As this sample (shown as x below) becomes bigger (more loops more values pop out) you can start doing things with it like calculate the sample mean.

y = mean(x)

The above is how we plug in the value generating engine into the space of variables bound to values. ‘y’ is another variable that represents the sample mean (one of the common sample statistics – other being the variance).

We can still reason about the engine. We are not limited to sampling values and just working with them. For example, the following is a perfectly reasonable assertion:

E[X] = 0 where X ~ N(0,1)

In the above ‘E’ is the Expected Value of the random variable. This means the distribution we are using for the random variable X is uniform around the 0 point.

We can relate sample statistics back to the engine proving that the sample came from the given engine.

y = mean (x) where x is sample with N values collected from X.
Therefore as N -> infinity, y -> E[X] = 0 where X ~ N(0,1)

The above snippet is also known as the Law of Large Numbers. As your sample size tends to get larger, your sample mean converges to the Estimated Value of the distribution.

Code

What is life without code… the small snippet below brings the above to life..

from scipy.stats import norm # normal distribution engine
for i in [100, 10000, 1000000, 100000000]:
# sample generator for normal distribution engine, note N(0,1) in rvs below
x = norm.rvs(0,1, size=i)
print("Sample size:",i, "\t\tSample Mean:", round(x.mean(),4))

Output:

Sample size: 100 Sample Mean: -0.0772
Sample size: 10000 Sample Mean: 0.0046
Sample size: 1000000 Sample Mean: 0.0002
Sample size: 100000000 Sample Mean: -0.0001

Note the convergence to 0.0 in the above as the sample size increases.

Play

I will leave you with the following question:

If in the above code we changed X ~ N(0,1) to X ~ N(1,1) rewriting line 6 in the above as:

x = norm.rvs(1,1, size=i)

What value will the sample mean converge to? Try and answer without running the code and then cross check. The question to ask: given normal distribution is symmetric about a point, what is that point for the above?

Attempt to use other distributions in the scipy.stats package and see what happens to the sample mean. This is your open door to the world of thinking in probabilities and dealing with randomness.