Portfolio Theory (Part 1)


Portfolio Theory was introduced by Harry Markowitz in 1952, which later awarded him a Nobel Prize in economics. Portfolio theory is a neat application of Lagrange multipliers. The objective function is second order and all constraints are first order. After applying Lagrange multiplier, it becomes a system of linear equations.

Problem Statement

Given N random variables with known expected value u[i],i=1..N and covariance matrix v[i,j] for i,j=1..N and a target return R. Find weight w[i] such that
1. sum[i=1..N]( w[i] ) = W
2. sum[i=1..N]( w[i]*u[i] ) = R
3. minimize sum[i=1..N,j=1..N]( w[i]*v[i,j]*w[j] ) i.e. variance of portfolio
(W is sum of weight, usually set W = 1)


let L(w,p1,p2) = sum[i=1..N,j=1..N]( w[i]*v[i,j]*w[j] )
+ p1*(sum[i=1..N]( w[i] ) – W)
+ p2*(sum[i=1..N]( w[i]*u[i] ) – R)

Differentiate L(w,p1,p2) w.r.t w[i] and set it to zero, we have
2*v[i,i]*w[i] + sum[j=1..N,i!=j]( w[j]*v[j,i] + v[i,j]*w[j] ) + p1 + p2*u[i] = 0
Since covariance matrix is symmetric, simplify, we have
sum[j=1..N]( 2*v[i,j]*w[j] ) + p1 + p2*u[i] = 0 for i=1..N — equation1
together with the 2 original constraints
1. sum[i=1..N]( w[i] ) = W
2. sum[i=1..N]( w[i]*u[i] ) = R
We have a N+2 independent linear equations for N+2 variables, so we have a system of linear function.

To find the assignment of w[i], we form the matrix of size (N+2)*(N+2) s.t.
M[i,j] = 2*v[i,j] for i,j=1..N
M[N+1,i] = M[i,N+1] = 1 for i=1..N
M[N+2,i] = M[i,N+2] = u[i] for i=1..N
else 0
(NOTE: this matrix is positive semidefinite)

And then a vector vec s.t.

vec[N+1] = W
vec[N+2] = R
else 0

And then solve inverse(M)*vec for answer. Use any decent linear solver for the job, usually solve in O(N^2)~O(N^3).


import numpy as np

def portfolio(u,v,R,W=1):
	n = len(u)
	m = np.zeros((n+2,n+2))
	m[0:n,0:n] = 2*v
	m[n,0:n] = m[0:n,n] = 1
	m[n+1,0:n] = m[0:n,n+1] = u
	v = np.zeros(n+2)
	v[n] = W
	v[n+1] = R
	return  np.linalg.solve(m, v)

Special Case

If all random variables are independent, then v[i,j] = 0 for i != j, we could find the solution in O(N) without matrix inverse.

Denote v[i] = v[i,i]
Simplify equation1, we have
2*v[i]*w[i] + p1 + p2*u[i] = 0
=> w[i] = -(p1+p2*u[i])/(2*v[i]) — equation2
(if some v[i] == 0, this method does not work)

put this into 1 & 2, we have
1. sum[i=1..N]( -(p1+p2*u[i])/(2*v[i]) ) = W
2. sum[i=1..N]( -(p1*u[i]+p2*u[i]^2)/(2*v[i]) ) = R

simplify, we have
let A = sum[i=1..N]( 1/v[i] )
let B = sum[i=1..N]( u[i]/v[i] )
let C = sum[i=1..N]( u[i]^2/v[i] )
1. A*p1 + B*p2 = -2*W
2. B*p1 + C*p2 = -2*R

let D = A*C – B^2, then
p1 = -2*(C*W-B*R)/D
p2 = -2*(A*R-B*W)/D
put them back to equation2
let t1 = C*W-B*R
let t2 = A*R-B*W
w[i] = (t1+t2*u[i])/D/v[i]


def portfolioIndependent(u,v,R,W=1):
	inv = 1./v
	a = inv.sum()
	b = (u*inv).sum()
	c = (u*u*inv).sum()
	d = a*c - b*b
	t1 = c*W - b*R
	t2 = a*R - b*W
	return (t1+t2*u)/d/v

Lagrange Multipler

The idea of Lagrange multiplier is to convert constrained optimization into solving system of equations of higher dimension.


1. D[f(x),y](v)  = differentiate f(x) with respect to y and substitute v into x
2. sum[i=1..N]( f(i) ) = f(1)+f(2)+…+f(N)
3. min[x in X]( f(x) ) = minimum value of f(x) over all element in set X.
4. s.t. is short form of such that


A constrained optimisation usually in form of
min[x in X]( f(x) ) s.t. g[i](x) = 0 for i=1..M
where x is N-dimensional, M is number of constraint.

The Lagrangian L(x,a) of the above optimisation is
L(x,m) = f(x) + sum[i=1..M]( m[i]*g[i](x) )
where m is M-dimensional
m are called Lagrange multipliers

The theorem of Lagrange Multiplier state that the minimum/maximum of f(x) are also minimum/maximum of L(x,m). Therefore min/max of f(x) is one of the roots of the following N+M equations
D[L(x,m), x[i]](x,m) = 0 for i=1..N
g[i](x) = 0 for i=1..M

Proof of Lagrange Multiplier

Let L(x,m) = f(x) + sum[i=1..M]( m[i]*g[i](x) )


1. For all x satisfy g[i](x) = 0 for i=1..M, we have L(x,m) = f(x) for all m
2. All extrema of L, say x’, satisfy g[i](x’) = 0 for i=1..M


min[x] f(x) s.t. g[i](x) = 0 for i=1..M
= min[x,m] f(x) s.t. g[i](x) = 0 for i=1..M (f(x) is independent of m)
= min[x,m] L(x,m) s.t. g[i](x) = 0 for i=1..M (by lemma 1)
= min[x,m] L(x,m) (by lemma 2, the constraint is always satisfied, so could be omitted)

Also see

Portfolio Theory

Cox’s Theorem

Cox’s theorem is a celebrated result stating that any system satisfy the Cox’s conditions is isomorphic to probability. But what is more surprising to me is that in the proof of Cox’s theorem, he proved that All associative binary operation with second order derivative is isomorphic to multiplication.

Let f be an associative binary operator, with second order derivative. Then, there exist function g and h, such that
1. h(g(x)) = x for all x
2. f(a,b) = h(g(a)*g(b)) for all a, b

1. All associative binary operation with second order derivative is isomorphic to addition.
2. All associative binary operation with second order derivative is commutative.


Structure of Supervised Learning

Many textbooks start with classifying machine learning into unsupervised, supervised and reinforcement. But, unfortunately, they are poorly defined. And after all this classification has no significance, except supervised learning has a structure to say. Anyway, arguing on definition is meaningless.

Unsupervised Learning

In unsupervised learning, we select a function f, such that given input x, we will have desired output y. i.e.

y = f(x)

Loosely speaking, any function f that you think is meaningful belong to unsupervised learning. e.g. average, min/max, Graham scan,  XOR… however, usually, we are only interested in clustering, dimensionality reduction.

Supervised Learning

In supervised learning, instead of selecting function f, we select parameter t of function f with respect to a cost function h. And the cost function h depends on data d. i.e.

y = f(x;t) where t = argmin[t’ in T]( h(d,t’) )

T is set of all possible parameters
denote argmin[x in X](f(x)) as x in that that f(x) is minimum.

Since d is usually a sequence of pairs dx[i], dy[i], i=1..and h is usually sum of distance measure between f(dx[i]) and dy[i] and so usually

y = f(x;t) where t = argmin[t’ in T]( sum[i=1..]( distance( f(dx[i]; t’), dy[i]) ) )

You may hear semi-supervised learning, which belong to the first form above.

Reinforcement Learning

Any algorithm neither unsupervised nor supervised belong reinforcement learning. There should be many forms of reinforcement learning. But textbooks mostly solely associate reinforcement learning with Markov Chain.

(For example, synchronization like this can be thought as mutual learning, which is neither supervised nor unsupervised. And it does not seem directly map to Markov Chain.)

Guessing Game & Non-computability

#include <stdio.h>

int main(){
	int n;
	printf("==== Guessing Game ====\n\n");

	printf("1. You guess a number, either 0 or 1.\n");
	printf("2. then I will tell you the correct number.\n\n");

	printf("Please make a guess 0 or 1:\n");
	scanf("%d", &n);
	printf("The correct number is %d.\n", !n);

The rule of guessing game is simple:
1. You guess a number, either 0 or 1.
2. then I will tell you the correct number.
This game is rather silly, the correct number is always opposite to yours. But that is exactly the argument of non-computability.

Consider the question

Can you predict the number correctly?

You will find that it is either providing a wrong answer or providing no answer.

Lifespan of RAID-0 vs Single Hard Disk Theoretically

Assume the lifespan of hard disks are independent and identically distributed.

Let the probability of lifespan of a hard disk be P(t)
Let the CDF of P(t) be C(t)

CDF of minimum of 2 i.i.d. random variable is 1-(1-C(t))^2
And then differentiate the above w.r.t. t, we have
PDF of RAID-0 is 2*(1-C(t))*P(t)

So, for any distribution of lifespan P(t), the RAID-0 comparing to single hard disk is always 2*(1-C(t)) times more likely to fail.


To illustrate the difference, let further assume P(t) is Normal distribution with mean = 5 years and s.d. = 1 years