The first issue when seeking to examine the correlation between two variables is determining the optimal dataset to examine. This is highly contentious since it is quite likely that spurious correlations can appear in datasets that have been excessively manipulated either by cherry picking the date range or by eliminating outliers. For that reason, I will select the largest reasonable dataset and do nothing to it.
I think the earliest reasonable starting point is somewhere around 1950. Much before then, and one is looking at periods where the economy is so different to that of the current day, that it would be inappropriate to try to draw any conclusions. In addition, it is not that helpful to produce a model going further back and have to say, when asked about a prediction, that “this is what happens when Hitler invades Poland…”
Here is a plot showing US GDP since 1950
This immediately looks like some kind of power law growth with a kink for the global financial crisis. (We don’t really have enough data to look at COVID properly yet, but I expect it will produce another kink and not really disturb the trend line.)
This second plot is the S&P 500 index over the same timescale.
This also shows some sort of power law rise, though it is clearly much noisier than the GDP curve. This does not however obscure the direction of travel.
Here’s a brief aside about polynomials. This is skippable if you don’t care what curves I am going to fit to the above two lines. Polynomials are curves of the form y = ax^3 + bx^2 + cx + d. The order of the polynomial is how many of the a, b, c, d coefficients are non-zero. So a first order polynomial is just a straight line of the usual form y = mx +c. The reason this matters is that we do not want to overfit or underfit. If we underfit, we fail to capture information in the curve we are modelling. If we overfit, we get all the information in the curve, but we may add some features that are not really there. The best way to check for that is to see what happens if you extrapolate the curves beyond your training data. If you have overfitted, it will often be the case that your predictions go insane as soon as you are off-piste — i.e. outside the training dataset.
Here is an illustration of what under and overfitting might look like.
The blue circles are some arbitrary test data we want to fit. The other curves are the various orders of polynomial returned by a fitting package. As you can see, the first order polynomial — just a straight line as mentioned above –is totally inadequate and is a radical underfit of the data. The second order is not bad, but the third order is very good. I have run checks like this plus the original test of asking what happens off-piste. I found that third order was fine to use.
Fitting a third order polynomial to US GDP gives me the following plot.
I have slightly adjusted the data to make it fittable. The x-axis is now the number of days since 03-Jan-1950. I also zeroed out the US GDP for 1950 which I don’t think matters too much. You can see that the curve is a good fit to the data apart from a rogue rise at low day counts which again I don’t think matters because we aren’t going to try to predict GDP in 1950 — we already know it.
If you want to know the equation of that curve, it is :-5.265e-11 x + 4.351e-05 x – 0.2643 x + 414.1
If you think that the very small cubic coefficient means I don’t need third order, you are basically right but again I don’t think it matters as long as the curve extrapolates reasonably.
Performing similar manipulations on the S&P500 (recalibrating to days since the start of 1950 and setting the initial value to zero) gives me a different fitted curve, as below.
Note that in both cases, there are around 25000 days between the start of 1950 and today. (70 * 365 = 25,550).
Here, the equation of the fitted line is: 2.726e-10 x – 3.87e-06 x + 0.02366 x – 10.68
Again, the cubic coefficient is very small. The notable point about this curve is that it is very noisy. But it still follows a clear trend line.
Now we get to the dangerous part. We need to see if these curves behave once we extrapolate them. Let’s look at what they do if we double the day count to 50,000. (This is equivalent to making the date range double — so we are currently 70 years on from 1950; 70 further years added on takes us to 2090.)
There are two reasons this is dangerous. As I said, if the curves blow up, we have achieved nothing. The other issue is that you can’t extrapolate power laws forever. I will now discuss a brief example of that not working.
You will recall that in the early stages of COVID, people were plotting case counts and fitting exponentials to them. That looked like it was panning out to begin with. But the plot below shows what happens if you fit an exponential to the Florida case count as of yesterday.
It can be seen that the case count profile is in no way fitted by the exponential. So is it the case that the US economy can continue growing on an exponential basis indefinitely? Probably not. When will it stop doing so? No-one knows, but it might be beyond your investing lifetime. Also, bear in mind that it will have looked like it couldn’t continue growing exponentially at most points since 1950. We don’t know what technology or quantum computing or unimaginable developments are going to do.
So what happens if we extrapolate the curves we fitted? This is shown in the plot below.
This shows firstly and importantly that the curves continue to behave out to 50,000 days.
We can conclude that if the S&P500 and US GDP continue to develop as they have done, then by the year 2090, US GDP will have reached USD $90tn and the S&P 500 will have reached 26,000.
There are various conclusions that I have not argued for. I have said nothing about other countries. The above analysis would definitely not work for the FTSE-100 because that does not grow exponentially. It seems rather to exhibit a large saw tooth oscillation between 4000 and 7000 with a period of about a decade. That won’t correlate with anything. Similar points apply in Japan.
Secondly, I have not shown what happens if you try to fit curves to only more recent data. They are very noisy and that explains why a lot of analysis does not show any correlation.
Thirdly, this does not really backup Portnoy when he says “stocks always go up.” Partly that is the case because of the huge noise in the S&P 500 curve. Partly it is the case because you might have to wait a long time. Partly it is the case because the curve only shows that the S&P 500 always goes up. But the trend is your friend here if you wait long enough.