Copystock Methodology Notes

Motivation

Suppose you wanted to get similar exposure to trading NVIDIA but with ETFs. One simple way is to look for ETFs that contain NVIDIA at a large weight. However, even if that weight is 25% in a semiconductor ETF, for every $100 you invest in the ETF you only get $25 of exposure to NVDA. Additionally, you'll be getting $75 of exposure to other stocks! However, imagine instead you bought an ETF that was long NVDA and another ETF that was short some of those other stocks. While deploying more dollars, you'd build a more concentrated exposure to the actual target stock.

This tool uses returns to find a combination of ETFs that get close to the stock returns.

Data

First, we grab a list of stocks and ETFs. Next, to limit the set of stocks, we choose ~1000 of the top by market cap from that original list. We remove any that Yahoo Finance indicates as being outside of the US. This simplifies the analysis and allows us to use daily returns -- if we have any Asia based stocks/ETFs here, the async nature of the trading sessions would likely involve using weekly returns instead. Single name ETFs were also removed (trivially, they replicate the exposure).

Model Fitting

To keep things simple, we'll be fitting our portfolio of ETFs using a year of data (2023). For any particular stock, we have about ~2000 ETFs in consideration. Regressing 1 year of a stock's returns on 2000 regressors is a recipe for disaster. We want to output "human-sized" portfolios here, targeting around 5 ETFs. Instead, we compute the return correlation over 2023 of the stock with each of the 2000 ETFs, then select the 10 most positive correlated ETFs and 10 most negative correlated ETFs as a smaller set of reasonable candidates.

Next, we run a lasso regression of stock returns on these candidate returns. We also apply a non-negativity constraint in the regression, preventing answers that involve shorting ETFs. The lasso penalty helps us by selecting a smaller set of ETFs (variable selection) and also mitigating offsetting exposure explosions ($10,001 in a long semiconductor ETF, $10,000 of a short semiconductor ETF). But what penalty to use?

The penalty is different stock by stock. We apply an iterative approach that starts with a small penalty and increases it until the regression selects at most 5 ETFs. The output is a fixed set of weights on each of the ETFs.

Testing

For each stock, using the fixed weights from the regressions over 2023, we compute the performance of the replicating portfolio of ETFs. We measure the following quantities both in-sample and out-of-sample. These descriptions are simplified definitions; more information can be found on Wikipedia.