Hierarchical Clustering of Stocks by Document Similarity

Corporate filings provide a wealth of information and are available for all major publicly traded corporations. Apart from financial statements, most of the information is in textual form. To make them usable in algorithmic trading strategies one has to preprocess them with tools from natural language processing. In the following, I’ll demonstrate a method to construct clusters of stocks based on similarity of documents. In addition, topic modeling by singular value decomposition (SVD) and latent Dirichlet allocation (LDA) of Blei et al. (2003) is used to estimate exposures to risk factors. 

Hierarchical Clustering by 10-K Item 1

10-K filings of the 500 largest US stocks are downloaded. Certain sections such as Item 1 and Item 1A are extracted based on regex rules. 387 documents for each section remain after filtering. Next, a machine readable term-document matrix is constructed via a tfidf-vectorizer. To cluster stocks according to the business description the cosine similarity matrix of the vectorized Item 1 is fed into a ward clustering algorithm and the resulting linkage matrix is visualized via a dendrogram in Figure 1.

Figure 1

Three main clusters with many sub clusters emerge. The cluster colored in red, for instance, groups banks together. 

Topic Modeling of 10-K Item 1

By singular value decomposition (SVD) of the tfidf-vectorized matrix, we can extract the most important topics. Figure 2 displays the 20 most important topics with its respective 10 most common terms. While not perfect — keep in mind we’re only using 387 documents — some clear patterns emerge. For example, the second topic relates to financial institutions, the third topic relates to energy companies, and the fourth topic relates to pharmaceuticals.

Figure 2

Figure 3 displays the entropy heatmap from LDA topic distributions of all stocks.

Figure 3

Hierarchical Clustering by 10-K Item 1A

Instead of clustering stocks according to business description, it’s also interesting to group them according to risk factors. Doing the same procedure as described before for Item 1A, we obtain the dendrogram displayed in Figure 4.

Figure 4

Topic Modeling of 10-K Item 1A

Observing the topics extracted via SVD in Figure 5, one can identify several topics relating to, e.g., oil, real estate, automotive, and airline risks. Again, the topics are not perfectly clean and hyperparameter tuning and more data will probably improve results considerably. 

Figure 5


Use cases for these methods are, for example, statistical arbitrage strategies and portfolio optimization. One can construct very granular clusters of stocks belonging to similar businesses. The similarity matrix can also be input to a function that creates a positive-definite matrix which can then be used as a shrinkage target in covariance matrix estimation. 


Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3.Jan (2003): 993-1022.

Code available at https://github.com/jpwoeltjen/nlp_finance

Understanding Portfolio Optimization by Simulation

Portfolio optimization is of central importance to the field of quantitative finance. Under the assumption that returns follow a multivariate normal distribution with known mean and variance, mean-variance optimization, developed by Markowitz (1952), is mathematically the optimal procedure to maximize the risk-adjusted return. However, it has been shown, e.g., by DeMiguel, Garlappi, and Uppal (2007), that out-of-sample performance of optimized portfolios is worse than that of naively constructed ones. The question is why? Do the results break down if the normality assumption is violated? I will show Monte Carlo evidence under heavy tailed distributions that suggest the cause for this strange phenomenon to be a different one.

The actual problem is that, in practice, mean and variance are not known but need to be estimated. Naturally, the sample moments are estimated with some amount of error. Unfortunately, the solution, i.e., the optimal weight vector, to the optimization problem is very sensitive to these estimation errors. The resulting instability of the solution tends to increase with the number of assets in the universe. 

To get an intuition why this would happen, imagine N uncorrelated assets. Hence, the corresponding (unknown) covariance matrix has only zeros off its diagonal. It is, however, very unlikely that a sample covariance matrix will estimate all off-diagonals to be exactly zero, given a finite sample. You can easily see that the probability that this would happen decreases with N. Now, suppose that the estimated sample covariance for some asset i and some other asset j is some positive number. The optimization algorithm now believes that it can reduce the portfolio risk resulting from longing i by shorting j. We know, however, that this would not be optimal since the true data generating processes of i and j are uncorrelated. 

The solution to the instability problem is to prevent the optimization algorithm from considering unrelated assets as related. This can be done in two ways. Either we reduce noise in the covariance matrix or we allow hedging between instruments only in certain cases where we have a prior believe that future returns will indeed be correlated. For example, the latter solution implies that we may be comfortable asserting that two airlines expose the portfolio to similar kinds of risk, e.g., oil price, political risk, regulatory risk, travel demand, terror risk, etc., and by selling one of them short we may hedge out these risks resulting from holding long the other. In contrast, just because the sample covariance for an airline stock and some other, totally unrelated, stock is slightly positive, this does not mean we should sell short a large chunk of that second stock to hedge out the risks of the airline stock, although the mean-variance optimizer would regard both scenarios the same. 

For me, the best way to get a deeper understanding of the subject is to simulate sample paths, where I know the data generating process, and to observe the behavior of different optimization techniques. In the following, I will compare mean-variance optimization based on sample moments with the same based on a noise-reduced covariance matrix using principal component analysis (PCA). To demonstrate the other approach, I will introduce a hierarchical optimization algorithm, where I use the fact that stocks usually cluster into industries. As a lower bound benchmark, I show results for naive equal weighting. We will see that, out-of-sample, portfolios based on the sample covariance matrix underperform this benchmark substantially. The PCA and hierarchical methods, however, perform significantly better. These results are robust to heavy tailed noise distributions as evidenced by simulated Generalized Autoregressive Conditional Heteroscedasticity (GARCH) sample paths.

But first, let’s simulate a bunch of stocks belonging to different industries. Each stock is composed of a deterministic trend \mu , a loading on the market related stochastic trend \beta_{market} and a loading on the industry specific stochastic trend \beta_{industry}. For each stock, \mu, \beta_{market}, and \beta_{industry} are sampled from a uniform distribution. In the first simulation, I compare mean-variance optimization with a naive diversification approach where each asset is equally weighted with the sign of the expected return. I assume perfect knowledge of the expected return. Traditionally, the expected return is estimated by the in-sample first moment of the asset return. In my example, this would actually make sense since the mean is a consistent estimator of the deterministic trend. In practice, however, there is probably not a stationary deterministic trend. This is where the alpha model steps in, which is not the topic of this study and thus assumed given. 

The return of each stock is thus given by

R_{i t} = \mu_{i}+\beta_{i 1}f_{1 t}+\beta_{i 2} f_{2 t}+\cdots+\beta_{i k} f_{k t}+\epsilon_{i t} = \mu_{i}+\sum_{\ell=1}^{k} \beta_{i \ell} f_{\ell t}+\epsilon_{i t},


  • R_{i t} is the return of asset i at time t,
  • {f_{\ell t}} is the \ellth common factor at time t,
  • \beta_{i \ell} is the factor loading or factor beta of asset i with respect to factor \ell,
  • \epsilon_{i t} is the asset-specific factor or asset-specific risk,
  • i=1, \ldots, N,
  • \ell=1, \ldots, k.

The k \times k covariance matrix of the factors, \boldsymbol{f}_{t}=\left[f_{1 t}, f_{2 t}, \ldots, f_{k t}\right]^{\prime}, is \mathrm{Cov}\left(\boldsymbol{f}_{t}\right)=\boldsymbol{\Sigma}_{f}. Asset-specific noise is uncorrelated with the factors, i.e., \mathrm{Cov}\left(f_{\ell t}, \epsilon_{i t}\right)=0, for \ell=1, \ldots, k, i=1, \ldots, N, and \quad t=1, \ldots, T. The asset-specific noise is uncorrelated across assets and for each asset it’s serially uncorrelated, i.e.,

{\qquad \boldsymbol{\Sigma}_{\boldsymbol{\epsilon}}=\mathrm{Cov}\left(\boldsymbol{\epsilon}_{t}\right)=\left[\begin{array}{cccc}{\sigma_{\epsilon_{1}}^{2}} & {0} & {\cdots} & {0} \\ {0} & {\sigma_{\epsilon_{2}}^{2}} & {\cdots} & {0} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {0} & {0} & {\cdots} & {\sigma_{\epsilon_{N}}^{2}}\end{array}\right]}.

The diagonal covariance structure reflects the assumption that all correlation between assets is due to the factors.

Let’s write the factor model as

\boldsymbol{R}_{t}=\boldsymbol{\alpha}+\boldsymbol{B} \boldsymbol{f}_{t}+\boldsymbol{\epsilon}_{t},

where in the N \times k matrix \boldsymbol{B},

{\qquad \boldsymbol{B}=\left[\begin{array}{c}{\boldsymbol{\beta}_{1}^{\prime}} \\ {\boldsymbol{\beta}_{2}^{\prime}} \\ {\vdots} \\ {\boldsymbol{\beta}_{N}^{\prime}}\end{array}\right]=\left[\begin{array}{cccc}{\beta_{11}} & {\beta_{12}} & {\cdots} & {\beta_{1 k}} \\ {\beta_{21}} & {\beta_{22}} & {\cdots} & {\beta_{2 k}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\beta_{N 1}} & {\beta_{N 2}} & {\cdots} & {\beta_{N k}}\end{array}\right]},

the \ellth column contains the beta coefficients associated with factor \ell. The covariance matrix of the returns \boldsymbol{R}_{t}=\left[R_{1 t}, R_{2 t}, \ldots, R_{N t}\right] implied by the factor model is

\boldsymbol{\Sigma}=\mathrm{Cov}\left(\boldsymbol{R}_{t}\right)=\boldsymbol{B} \boldsymbol{\Sigma}_{f} \boldsymbol{B}^{\prime}+\boldsymbol{\Sigma}_{\epsilon}.

Hence, the variance of the returns of asset i and the covariance of the returns of asset i and j are

\mathrm{Var}\left(R_{i}\right)=\beta_{i}^{\prime} \boldsymbol{\Sigma}_{f} \beta_{i}+\sigma_{\epsilon_{i}}^{2}


\mathrm{Cov}\left(R_{i}, R_{j}\right)=\beta_{i}^{\prime} \boldsymbol{\Sigma}_{f} \beta_{j},

respectively. Since the factors are uncorrelated in this simulation, the covariance matrix simplifies to

\boldsymbol{\Sigma}=\boldsymbol{B} \boldsymbol{\Sigma}_{f} \boldsymbol{B}^{\prime}+\boldsymbol{\Sigma}_{\epsilon}=\sum_{\ell=1}^{K} \beta_{\ell} \beta_{\ell}^{\prime} e_{f_{\ell}}^{2}+\Sigma_{\epsilon},

where \beta_{\ell} is the vector of loadings with respect to factor \ell, i.e., the \ellth column of matrix \boldsymbol{B}. Thus the variance of asset i‘s returns is

\sigma_{i}^{2}=\sum_{\ell=1}^{k} \beta_{i \ell}^{2} \sigma_{f_{\ell}}^{2}+\sigma_{\epsilon_{i}}^{2}

and the covariance between the returns of assets i and j is

\sigma_{i j}=\sum_{\ell=1}^{k} \beta_{i \ell} \beta_{j \ell} \sigma_{f_{\ell}}^{2}.

As an example, Figure 1 shows sample paths of 100 stocks belonging to the same industry.


Every stock’s return in this universe is partly driven by the market return (in proportion to its \beta_{market}) and the industry return (in proportion to its \beta_{industry}). Hence, one asset can be used to hedge out the factor risks of the other by selling it short. Doing this by mean-variance optimization based on the sample covariance matrix (and known expected return), the resulting portfolio equity under period-wise rebalancing develops as shown in Figure 2.

Figure 2

This very smooth equity curve can be achieved since we can hedge out the systematic factor risk. This makes our bets independent. Independence allows us to reason by the weak law of large numbers that the realized returns will converge in probability to the expected returns. This is only true if short sales are permitted. If we exclude shorting, we can only reduce the asset specific risk while exposing the portfolio to the irreducible systematic risk. The devastating effect this has on our portfolio can be seen by the much more rugged equity curve in Figure 3.

Figure 3

Hierarchical Optimization

In reality there isn’t just one industry but many. In the following, I will simulate a universe of stocks belonging to nine different industries. The corresponding sample paths are depicted in Figure 4.

Figure 4

Now, all stocks are still partly driven by the market, but industry returns are uncorrelated. Estimation errors may cause problems if one stock is falsely used as a hedging device for the industry exposure of another stock, and both stocks do not belong to the same industry. The overfit of Markowitz’s method to in-sample data can best be visualized by a boxplot. Figure 5 shows that the in-sample information ratio for the sample covariance optimization is way above the best possible portfolio using ground truth parameters. Unsurprisingly, this overfit hurts the out-of-sample performance, as it can’t even beat simple equal weighting.

Figure 5

If we believe the above characterization of the problem is correct, it seems like a good idea to encode prior knowledge of industry clustering by optimize weights within clusters and then optimize allocation to these clusters. This hierarchical approach results in the in-sample and out-of-sample performance being much closer to the ground truth. Imposing structure, mean-variance optimization now beats equal weighting by a large margin.

Principal Component Analysis

Principal Component Analysis (PCA) allows us to reduce the rank of the covariance matrix. Since the covariance matrix is symmetric positive definite, we can, due to the spectral theorem, decompose the matrix into its real eigenvalues and orthonormal eigenvectors.

S=Q \Lambda Q^{-1}=Q \Lambda Q^{\mathrm{T}} with Q^{-1}=Q^{\mathrm{T}}

In the simulation we have exposure to the market and several industries. Thus, it makes sense to reduce the rank to the number of these variables. Plotting the proportion of variance explained by the first n principal components in Figure 6 confirms this hypothesis.

Figure 6

Decomposing the covariance matrix into its eigenvalues and eigenvectors, we can de-noise it by eliminating all eigenvalues below some threshold. Since the sample covariance matrix is positive definite, all its eigenvalues will be positive. Reconstructing a matrix from non-negative eigenvalues will likewise result in a positive semi-definite matrix. We thus don’t have to worry about the resulting matrix not being a proper covariance matrix, which can be a headache with more flexible estimation approaches. A heatmap of the reconstructed covariance matrix from the largest 10 eigenvalues is shown in Figure 7.

Figure 7

This covariance matrix looks very similar to the sample covariance matrix shown in Figure 8, despite having only a rank of 10. We can infer that it captures the important linear relationships between assets with less noise.

Figure 8

Observing the information ratios depicted in Figure 9, the PCA de-noising method seems capable of restricting factor hedging attempts to reasonable candidates. It beats mean-variance optimization based on the sample covariance matrix in out-of-sample data by a large margin.

Figure 9

Robustness under heavy tailed distributions

Until now, we’ve assumed Gaussian returns. It is a well known fact, though, that real stock return distributions are not Gaussian but heavy tailed. In the following, we will look at simulation results when asset specific noise is modeled by a GARCH process.

Asset specific noise, \epsilon_{i, t}, is now defined by

\epsilon_{i, t}=\sigma_{i, t} \eta_{t},


  • \sigma_{i, t}^{2}=\omega+\sum_{n=1}^{q} \alpha_{n} \epsilon_{i, t-n}^{2}+\sum_{m=1}^{p} \beta_{m} \sigma_{t-m}^{2}
  • \eta_{t} \stackrel{\text {iid}}{\sim}(0,1)
  • \omega>0
  • \alpha_{n} \geq 0
  • \beta_{m} \geq 0
  • m=1, \ldots, p
  • n=1, \ldots, q

The unconditional variance of asset i‘s specific noise distribution is

\mathrm{E}\left(\sigma_{\epsilon_{i, t}}^{2}\right)=\omega+\sum_{n=1}^{q} \alpha_{n} \mathrm{E}\left(\epsilon_{i, t-n}^{2}\right)+\sum_{m=1}^{p} \beta_{m} \mathrm{E}\left(\sigma_{\epsilon_{i, t-m}}^{2}\right) =\omega+\left(\sum_{n=1}^{q} \alpha_{n}+\sum_{m=1}^{p} \beta_{m}\right) \mathrm{E}\left(\sigma_{\epsilon_{i, t}}^{2}\right),

which, when solved for \mathrm{E}\left(\sigma_{\epsilon_{i,  t}}^{2}\right) becomes

\mathrm{E}\left(\sigma_{\epsilon_{i, t}}^{2}\right) =  \frac{\omega}{1-\sum_{n=1}^{q} \alpha_{n}-\sum_{m=1}^{p} \beta_{m}}.

I use this result to compute the ground truth covariance matrix. Sample paths according to a GARCH(1,1) data generating process are shown in Figure 10.

Figure 10

It’s easy to see that returns are much more extreme than under Gaussian noise. Since Markowitz’s mean-variance optimization tends to over-concentrate and thus expose the portfolio to asset specific risks, which are now heavy tailed, it performs even worse than in the Gaussian case. As shown in Figure 11, the hierarchical method still outperforms equal weighting.

Figure 11

Noise filtering by PCA results in portfolio performances near the ground truth, as can be seen in Figure 12.

Figure 12


To maximize risk-adjusted returns, it is necessary to hedge out systematic risk. How this hedging is done is the subject of portfolio optimization. We’ve seen that it’s important to regularize portfolio optimization, since estimation errors of the covariance matrix would otherwise push out-of-sample performance below that of naive approaches. How to achieve the regularization is a research topic all on its own. We’ve seen two approaches that can be developed far beyond the basics shown here. In addition, correlations between assets tend to be non-stationary. This fact needs to be respected as well. Furthermore, asset returns are driven by multiple factors and business models spread further into sub-clusters within industries. Larger corporations may also belong to multiple clusters. There’s so much more to learn. The code of the simulation, every plot shown here, and all the methods used to compute the portfolio weights is available at https://github.com/jpwoeltjen/OptimizePortfolio.


DeMiguel, Victor, Lorenzo Garlappi, and Raman Uppal. “Optimal versus naive diversification: How inefficient is the 1/N portfolio strategy?.” The review of Financial studies 22.5 (2007): 1915-1953.

Markowitz, Harry. “Portfolio selection.” The journal of finance 7.1 (1952): 77-91.

A Recurrent Neural Network Framework for Predicting Asset Returns 


The investing profession is unlike most other professions. To achieve excellent performance you need to act differently than the majority. If a group of proficient engineers collaborate in building a light bulb, they’ll probably come up with a well designed light bulb. This is because there is a solution to this problem that doesn’t depend on other light bulb designs. A light bulb is not a complex adaptive system. In the markets, a similar group of collaboration would not lead to an excellent outcome. Whatever the majority perceives to be right gets priced into the market. It is therefore necessary that actions of the excellent professional are contrary to the opinion of the majority. Of course, this is not yet a sufficient set of conditions. Importantly, the conclusions giving rise to the actions have to be in some kind of way better. I have no doubt the reader realizes that the first element of this argument is contained in the second one, as ‘better’ necessarily implies ‘different’. I think it is useful nonetheless to state the condition explicitly as to not set oneself up for failure. In a strict sense, all the efforts directed to reading books, blogs, news, analyses, and basically anything else by other people would be wasted, not even pointing out the fact that in many cases if they really had this informational edge, why would they sell it to you instead of acting on it themselves gaining much more in the process. I think the argument in its strict form is not always true as e.g. value investing capitalizes on superiority of emotional stamina instead of an informational advantage.

Instead of following well known investing strategies, I’m focussing my efforts on developing systems that generate truly unique strategies. For that it’s useful to state fundamental principles that excellent investing strategies must abide to, and reasoning from there upwards.

  1. Diversification with uncorrelated (or even better negatively correlated) return streams
  2. Compounding
  3. Reducing the risk of catastrophic losses to practically zero
  4. Having an edge

The standard deviation of a portfolio’s return stream declines with the square root of the number of uncorrelated assets (n) within that equally weighted portfolio. It is critically important that these assets — or rather the income streams from trading these assets, which are not the same thing — are uncorrelated. If you had a portfolio of 30 equally weighted assets, and these assets would give you income streams 60% correlated with each other and each stream had a Sharpe ratio — the expected excess rate of return divided by its standard deviation— of 0.2, then the portfolio Sharpe ratio would be: Sharpe_c = n*0.2/(n + 2*n(n − 1)/2 *0.6)^0.5 =0.255. Combining correlated assets only results in a small reduction in risk. Contrast this with the Sharpe ratio assuming uncorrelated assets: Sharpe_u = n*0.2/(n)^0.5 = n^0.5 * 0.2 = 1.095. Here the covariance part of the variance vanishes. The result is a much higher reduction in risk. Furthermore, asymptotically, as n gets large, the risk approaches zero in the uncorrelated case but not in the correlated case. To illustrate, if you would include 100 assets in the portfolio instead of 30, you would get a Sharpe ratio of Sharpe_u = 100^0.5*0.2 = 10*0.2 = 2 in the scenario where the return streams are uncorrelated. Again assuming 60% pairwise correlation, the Sharpe ratio is a mere Sharpe_c = 0.257 for the n=100 portfolio, almost no improvement against the n=30 portfolio.

As I, and this blog, started out with the value investing philosophy, I will shortly run this philosophy by the aforementioned principles. The core principles of value investing match point 2, 3, and 4 — point 1, however, is at odds. Since most value investors are long only stock pickers, diversification is a serious challenge for them. First since a comprehensive fundamental analysis is necessary to evaluate whether or not the investor has an edge in his favor, he will soon reach the limit of his capacity. It is simply not feasible to evaluate thousands of opportunities to come up with a selected basked of, say, 100 stocks without compromising quality. But even if he could, it would not result in true diversification as stocks are correlated and as n gets large the total variance is dominated by the covariance part. This is the reason why many value investors declare diversification beyond a certain threshold — typically between 5 and 50 — as useless and instead focus their funds on their most cherished ideas. If thats not enough to convince you that value investing, as it is practiced by most individuals, has a problem with diversification, consider the fact that Asness et al. (2013) found that the value premium is correlated across asset classes. So even if a more sophisticated value investor would consider a long/short value strategy across even uncorrelated asset classes, the transformation his investment strategy would apply to the asset return streams would yield correlated portfolio constituents. To be clear, I don’t discount value investing as a valuable part of a portfolio. It just can’t be an optimal strategy on its own.

Let’s return to the fundamental principles. The second principle is compounding. As Albert Einstein reportedly said  “Compound interest is the eighth wonder of the world. He who understands it, earns it … he who doesn’t … pays it.” Simply put 10% compounded for 100 periods is not 100*10% = 1,000% but 1.1^100-1 = 1,378,000%.

The third principle can be summarized as ‘everything times zero is zero’. It doesn’t matter how stellar one’s track record was, if there is one devastating year, it was all for nought.

The fourth principle is to have an edge. What is meant by this is basically that one has some kind of advantage over one’s competitors such that the expected value of one’s efforts is positive. The easiest —but not the only— way to gain such advantages is to carefully select one’s competitors. The harsh truth about trading the secondary markets is that you have to take the money from someone. It should be arguably more probable to have an edge against less sophisticated investors than against sophisticated professionals. Luckily for the individual trader there are some obvious ways to get out of the way of the most sophisticated professionals. Historically this blog focused on small, obscure stocks that provide opportunities that are not economical for the professional to exploit. Another way is to trade intraday, arguing that there is not enough liquidity in this timeframe for larger funds to employ similar strategies. I’m going to argue that the latter approach is more rewarding as it also allows for more frequent trades and thus more occurrences for the magic of compounding to do its thing.

Where does this rumination lead us? We need a system that generates many uncorrelated trading strategies, ideally trading very often. These strategies then are bundled with proper risk management into a portfolio. To his end I’ve developed a program that takes any data (price, fundamental, sentiment, satellite image data, etc.) and generates a trading strategy for a specified trading frequency. The core of this program is a recurrent neural network (RNN). More specifically, I use Long Short Term Memory (LSTM) units. LSTMs are a specific type of RNN that provide a solution to the vanishing gradient problem as shown by Hochreiter and Schmidhuber (1997). Basically, LSTM cells can look further into the past as traditional RNNs. LSTM networks are one important driver behind the recent advances in machine translation and speech recognition to give only two examples. For more examples read: http://karpathy.github.io/2015/05/21/rnn-effectiveness/.

As a test, I provided the framework only OHLC and datetime information for the EURUSD forex pair. The model extracted a Sharpe ratio of >5 and a CAGR of 60%. I’m the first one to point out that this is not an outcome you should expect to achieve in reality. But the model only gets price data, data that many believe has no predictive power whatsoever. For more information about this project refer to my Github page: https://github.com/jpwoeltjen.

Ugly vs. Pretty Value Stocks

As Warren Buffett rightly says, value and growth are joined at the hip.[1] It seems like a perfectly sensible strategy to pay more for high-quality businesses than for low-quality deep value stocks. And it is… in theory. In practice, it is extraordinarily difficult to find the right trade-off. There has been done some systematic research trying to improve value strategies by including a quality component. Quality, here, means anything one should be willing to pay for (e.g., ROE, ROIC, growth, profitability, etc.) And some of these studies show very counter-intuitive results.

A prominent example of a strategy that supplements a pure value ranking by a quality measure is Joel Greenblatt’s Magic Formula. In their outstanding book “Quantitative Value”, Wesley Gray and Tobias Carlisle show that the quality component actually decreases the performance of a portfolio based on a value ranking alone.[2] The likely reason for this is the mean reverting nature of return on capital, the used quality measure. Economic theory dictates increased competition if companies demonstrate high returns of capital, and exits of competitors if returns are poor. The new competition decreases returns for all suppliers. Exits of competitors increase returns for prevailing businesses. Betting on businesses with historically high returns seems like a bad idea on average, then.

As ambitious bargain hunters, we try to find high-quality businesses at low prices. Studies, however, show that (in competitive markets) valuation is far more important than quality. And in some cases, due to naïve extrapolation of noise traders, quality is actually associated with lower returns. Lakonishok, Shleifer and Vishny (1994) study exactly that. Their results fly in the face of many investors working hard to find the ‘best’ value stocks. Lakonishok et al. construct value portfolios not only based on current valuation ratios but also on past growth. They define the contrarian value portfolio as having a high Book-to-Market ratio (B/M) — the inverse of P/B — and low past sales growth (GS). The reasoning behind this is that by Lakonishok’s et al. definition, value strategies exploit other investor’s negligence to factor reversion to the mean into their forecasts. This is a form of base rate negligence, a tendency in intuitive decision-making found by Kahneman and Tversky (1982).[3] Lakonishok et al. thus identify stocks with low expected future growth (valuation ratio) and low past growth (GS) that indicate naïve extrapolation of poor performance. They show that this definition of value performs better than a simple definition based only on a valuation ratio (e.g., B/M.) Another way of looking at this is by subdividing the high B/M further into high and low past growth. The low past growth stocks outperform the high past growth stocks by 4% p.a. (21.2% vs. 16.8% p.a.) while the B/M ratios of these sub-portfolios “are not very different.” [4]

In his excellent book “Deep Value”, Tobias Carlisle shows insightful statistics for these portfolios. The incredible insight is that even if valuation ratios are practically the same, stocks that rank low on quality (past sales growth) perform better than high-quality stocks. One likely reason is mean reversion in fundamentals. contrarian investment

Source: Carlisle, Tobias: Deep Value: Why Activist Investors and Other Contrarians Battle for Control of “Losing” Corporations, 2014, John Wiley & Sons, Inc., Hoboken, New Jersey, pp. 131-132.

Similar results are also showing in the deepest of value strategies: net-nets. Oppenheimer (1986) shows that loss-making net-nets outperformed profitable net-nets (36.2% p.a. vs. 33.1%), and non-dividend-paying net-nets outperformed dividend-paying net-nets (40.6% vs. 27.0%) from 1970 to 1983. Carlisle confirms these results out of sample from 1983 to 2010.[5] My own backtests confirm these results from 1999 to 2015. My results at least are, however, mainly driven by the higher discount — profitable businesses don’t usually trade at large discounts to NCAV.

Whether you are a full quant or not, if you are trying to pick the ‘best’ stocks from a value screen you are likely making a systematic mistake — unless you are searching for businesses with moats (i.e., a sustainable competitive advantage that prevents a high return on capital to revert to the mean.) But good luck finding such a business in deep value territory consistently.

Regression to the mean is such a strong tendency and is systematically underestimated by market participants that just betting on historically poorly performing businesses outperforms the market.[6] Bannister (2013) finds that betting on “unexcellent” companies (ranking low on growth, return on capital, profitability) outperformed the market from 1972 to 2013 (13.74% p.a. vs. 10.59%). A portfolio constructed of stocks of “excellent” businesses, in turn, underperformed the market (9.77%).[7]

I still think good quality measures (i.e., measures that do not implicitly bet against regression to the mean in fundamentals) are a potent tool for improving a value ranking. It is, however, not as easy as layering a quality screen blindly over a value screen and thereby imply equal weights. A category of quality measures that is of special interest to me is distress/bankruptcy prediction. But even if the such a measure is very good at identifying value traps, there is still the very serious issue of false positives. That is, excluding stocks that actually perform well on average. A too sensitive measure will likely exclude all the ugliest stocks that perform the best. More research is needed to determine a sensible weighting mechanism. The merit of such a measure is dependent on the false negative error rate, false positive error rate, the cost of false negatives, and, importantly, on the cost of false positives. The cost of false positives may be very high for concentrated portfolios. Even if in studies the quality measure can improve performance, that doesn’t mean that it will improve a concentrated value portfolio (20-30 stocks). The reason is that these studies often hold a very diversified portfolio (e.g., a decile). This is quite a number of stocks. If the quality factor excludes 20 extremely cheap stocks, it’s not a big deal. If you were to hold the 30 cheapest stocks in the universe, however, and the quality factor excludes 20 of them and the next cheapest stocks have 2 times the valuation ratio, it is very likely that the performance will suffer. It will dilute the value factor too much. The important thing is to actually backtest your portfolio and not just rely on studies.

Another interesting area of research lies in identifying moats that prevent mean reversion of high return businesses. That, however, still leaves the question open if these businesses are systematically undervalued.


[1] Buffett, Warren: Letter to the Shareholders of Berkshire Hathaway Inc., 1992.

[2] Gray, Wesley and Carlisle, Tobias: Quantitative Value: A Practitioner’s Guide to Automating Intelligent Investment and Eliminating Behavioral Errors, 2013, John Wiley & Sons, Inc., Hoboken, New Jersey, Table 11.1.

[3] Kahneman, Daniel and Tverky, Amos: Intuitive Prediction: Biases and Corrective Procedures, in D. Kahneman, P. Slovic, and A. Tversky, Eds.; Judgment Under Uncertainty: Heuristics and Biases, 1982, Cambridge University Press, Cambridge, England.

[4] Lakonishok, Josef, Shleifer, Andrei and Vishny, Robert: Contrarian Investment, Extrapolation, and Risk, The Journal of Finance 49, no. 5, 1994, p. 1555. http://www.jstor.org/stable/2329262 OR http://lsvasset.com/pdf/research-papers/Contrarian-Investment-Extrapolation-and-Risk.pdf

[5] Carlisle, Tobias: Deep Value: Why Activist Investors and Other Contrarians Battle for Control of “Losing” Corporations, 2014, John Wiley & Sons, Inc., Hoboken, New Jersey, p. 133.

[6] Lakonishok, Josef, Shleifer, Andrei and Vishny, Robert: Contrarian Investment, Extrapolation, and Risk, The Journal of Finance 49, no. 5, 1994, p. 1549. http://www.jstor.org/stable/2329262 OR http://lsvasset.com/pdf/research-papers/Contrarian-Investment-Extrapolation-and-Risk.pdf

[7] Bannister, Barry, Stifel Financial Corp., and Eyquem Investment Management LLC from  Carlisle, Tobias: Deep Value: Why Activist Investors and Other Contrarians Battle for Control of “Losing” Corporations, 2014, John Wiley & Sons, Inc., Hoboken, New Jersey, p. 137.

Limits of Arbitrage

Many value investors acknowledge that there are many other smart traders, but believe these other traders somehow don’t understand value investing. It appears, a lot of value investors are hugely overconfident when it comes to their special insight, i.e., that value investing works and others just don’t get it. Yet, there is overwhelming evidence that value investing does work and continuous to work even after a lot has been written about it. So, why does the value premium persist? Fortunately, there are better explanations than ignorance. Behavioral finance tries to explain the outperformance of value strategies by differentiating between noise traders, arbitrageurs, and their clients. On the one hand, there has to be someone who, probably due to some bias, e.g. extending the recent negative earnings trend too far into the future and thereby ignoring regression to the mean, sells an asset at a price below fundamental value (the noise trader). On the other hand, there has to be some reason why professional traders with vast resources do not arbitrage this price/value gap away immediately. This is crucial but often ignored. Shleifer and Vishny (1997) explore a possible reason why mispricings may occur even if specialized arbitrageurs are knowledgeable and rational.[1] They do this by assuming that the arbitrageur and the owner of the invested money are two separate entities. According to their model, the arbitrageur’s clients update their prior beliefs about the arbitrageur’s competence by incorporating the recent performance of investments in their assessment. Understanding the limits of arbitrage can help us separating undervalued assets from superficially cheap but not actually underpriced assets.

The key insights are:

  • In academia, arbitrage is typically defined as riskless without the need of capital. In practice, however, it does require capital (usually part of it from outside investors) and is associated with several forms of risk.
  • Arbitrage is typically performed by specialized traders who are not well diversified.
  • Especially in value situations, assets can further decline in price in the short run, even if it is a good bet long-term.
  • Clients do not have perfect knowledge of the arbitrageur’s competence. It can thus be a rational choice to withdraw capital from an underperforming manager. This forces the manager to sell off assets, even though the expected return actually increased after the price drop.
  • An agency problem breaks down the link between greater mispricing and higher expected return from the client’s perspective.
  • This can result in irrational prices while the arbitrageurs and their clients themselves act rationally.

In which situations is arbitrage most limited then?

  • First and foremost: Small Size. The absolute dollar amount that can be earned arbitraging in extremely small situations is just too small to make the return on invested resources attractive for professional fund managers. This just leaves individual investors. But in the smallest situations, even these investors have to be either inexperienced or so far unsuccessful, or they would have gathered enough capital to make it uneconomical for them as well. Sounds like weak competition to me!
  • For professional fund managers, very volatile markets increase the risk of looking incompetent in the short term. Therefore, all else equal, we should expect less arbitrage in volatile markets.
  • The risk of further price declines is greater in situations that take a longer time to play out and are unpredictable. Hence, we should see more arbitrage in strategies that play out (at least partially) before clients can withdraw capital and less when there may be many months or even years of underperformance before the manager is eventually proven right.
  • Building on that point, this effect should be more severe for assets where there is clearly something wrong — the typical deep value stock. On average, betting on value stocks can be a good idea, but you can look extremely incompetent on any given investment. Hindsight bias compounds this issue. Value situations that do not play out look like stupid investments in hindsight. It might thus be a good idea, as an individual investor, to explicitly focus on situations where one can look extremely incompetent or neglecting on any individual investment to an ignorant outsider (incompetent or self-serving corporate insiders, industry downturn, loss of major customer, negative earnings trend, regulatory issues, etc.)

Of course, identifying areas where mispricings are likely is itself not a viable investment approach. Assets can be undervalued as well as overvalued. But combined with a value ranking, e.g. EV/EBIT, searching in less efficient markets can reduce the risk of buying statistically cheap stocks which prices are actually justified. This is a completely different approach of trying to exclude value traps than the typical qualitative assessment.

[1] Shleifer, Andrei, and Vishny Robert W. “The Limits of Arbitrage.” The Journal of Finance 52, no. 1 (1997): 35-55. http://links.jstor.org/sici?sici=0022-1082%28199703%2952%3A1%3C35%3ATLOA%3E2.0.CO%3B2-3