This file provides a walk-through of some of the analysis methods that could have been used in the assignment.
I also highlight some code patterns that might help you in future assignments.
NaN
using to_numeric
parse_dates=True
and the index_col
set parses the dates and sets them as the index. This is usually a good idea.pd.to_numeric
can be used to convert a Series
with strange missing value markerts to NaN
. import pandas as pd
uk = pd.read_csv("DEXUSUK.csv", index_col="DATE", parse_dates=True).iloc[:, 0]
jp = pd.read_csv("DEXJPUS.csv", index_col="DATE", parse_dates=True).iloc[:, 0]
ch = pd.read_csv("DEXSZUS.csv", index_col="DATE", parse_dates=True).iloc[:, 0]
au = pd.read_csv("DEXUSAL.csv", index_col="DATE", parse_dates=True).iloc[:, 0]
uk = pd.to_numeric(uk, errors="coerce")
ch = pd.to_numeric(ch, errors="coerce")
au = pd.to_numeric(au, errors="coerce")
jp = pd.to_numeric(jp, errors="coerce")
jpuk = jp * uk
auch = au / ch
jpuk.name = "JPUK"
auch.name = "AUCH"
The CSV files were cleaned prior to import.
pd.to_datetime
is a simple way to convert string or integer dates to pandas dates. The format
argument accepts many different styles. The most important are
%Y
: 4-digit year%m
: 2-digit month%d
: 2-digit dayfactors = pd.read_csv("ff-factors.csv")
factors["Date"] = pd.to_datetime(factors.Date, format="%Y%m%d")
mom = pd.read_csv("ff-momentum.csv")
mom["Date"] = pd.to_datetime(mom.Date, format="%Y%m%d")
factors = factors.set_index("Date")
mom = mom.set_index("Date")
ff_returns = pd.concat([factors, mom], axis=1)
# Add RF back to VWM, optional
# ff_returns["Mkt-RF"] = ff_returns["Mkt-RF"] + ff_returns["RF"].values
# Price index construction
ff = (1 + ff_returns / 100).cumprod()
ff.columns = ["vwm", "smb", "hml", "rf", "mom"]
ff = ff[["vwm", "smb", "hml", "rf", "mom"]]
The data was downloaded for the maximum period available.
kospi = pd.read_csv("^KS11.csv", index_col="Date", parse_dates=True)["Adj Close"]
bvsp = pd.read_csv("^BVSP.csv", index_col="Date", parse_dates=True)["Adj Close"]
dax = pd.read_csv("^GDAXI.csv", index_col="Date", parse_dates=True)["Adj Close"]
bond_fund = pd.read_csv("VBTLX.csv", index_col="Date", parse_dates=True)["Adj Close"]
kospi.name = "KOSPI"
bvsp.name = "BVSP"
dax.name = "DAX"
bond_fund.name = "bonds"
Here I combine all series into a DataFrame which will let me loop over columns easily.
Series
and DataFrame
s that have DatetimeIndex
indices. The slice includes all values in the range, including the end points. For example, [:"2019"]
includes all data up-to-and-including December 31, 2019. DataFrame
using the syntax:for col in df:
one_column = f[col]
# do something useful with the columns
combined = pd.concat([ff, kospi, bvsp, dax, jpuk, auch, bond_fund], axis=1)
combined = combined[:"2019"]
combined.tail()
summary_statistics
¶One of the required function for the autograder is presented below. I build a DataFrame
from Series
.
d = {}
) are important for writing efficient code. They are also easy to convert to pandas Series
and dictionaries of Series
can be easily converted in DataFrames
. This is exactly how I build the DataFrame
below. Dictionaries can also be looped over using the same syntax as a DataFrame
:d = {"a": 1, "b": 2, "c": 3}
for key in d:
value = d[key]
# do something useful
import numpy as np
def summary_statistics(prices):
prices = prices.dropna()
values = {}
# A better approach to scale would be to use a dictionary
scales = {"D": 252, "W-FRI": 52, "M": 12, "Q": 4, "A": 1}
for freq in ("D", "W-FRI", "M", "Q", "A"):
resample = prices.resample(freq).last()
returns = resample.dropna().pct_change().dropna()
# The better approack
# scale = scaled[freq]
if freq == "D":
scale = 252
elif freq == "W-FRI":
scale = 52
elif freq == "M":
scale = 12
elif freq == "Q":
scale = 4
else:
scale = 1
# A dictionary to convert to a Series
moments = {
"mean": scale * returns.mean(),
"std": np.sqrt(scale) * returns.std(),
"skew": returns.skew(),
"kurt": returns.kurt() + 3,
}
# Store the Series in another dictionary.
values[freq] = pd.Series(moments)
all_freq = pd.DataFrame(values)
all_freq.columns = ["Daily", "Weekly", "Monthly", "Quarterly", "Annual"]
return all_freq
summary_statistics(combined.vwm)
Here I collect all of the analyses into groups by frequency by storing them a dictionary
def group_by_freq(prices, series):
# Empty dictionary where each will hold another dictionary
grouped = {}
for col in series:
temp = summary_statistics(prices[col])
for freq in temp:
single = temp[freq]
# Check if the key exists, if not, create it with an empty dictionary
if freq not in grouped:
grouped[freq] = {}
# Store the Series as an element in the grouped[freq] dictionary
grouped[freq][col] = single
for col in grouped:
# Convert each inner dictionary to a DataFrame
grouped[col] = pd.DataFrame(grouped[col])
return grouped
by_freq = group_by_freq(
combined,
["vwm", "smb", "hml", "mom", "DAX", "BVSP", "KOSPI", "AUCH", "JPUK", "bonds"],
)
by_freq["Daily"]
by_freq["Monthly"]
by_freq["Annual"]
There is another way to do the same, which is to use a MultiIndex
where we have to levels to the index, freq and col.
all_stats = []
for col in ["vwm", "smb", "hml", "mom", "DAX", "BVSP", "KOSPI", "AUCH", "JPUK", "bonds"]:
stats = summary_statistics(combined[col])
stats.columns = pd.MultiIndex.from_tuples([(freq,col) for freq in stats])
all_stats.append(stats)
all_stats = pd.concat(all_stats,1)
all_stats.loc[:,"Daily"]
all_stats.loc[:,"Weekly"]
I look at monthly stats for each decade. One could look at other sampling frequencies or decade definitions.
DatetimeIndex
that let blocks be selected using string dates, e.g., data["1950":"1959"]
which selects all data with years in 1950 to 1959, inclusive. def monthly_stats(prices, decade_starts):
values = {}
for decade_start in decade_starts:
sub = prices[str(decade_start) : str(decade_start + 9)]
values[decade_start] = summary_statistics(sub)["Monthly"]
return pd.DataFrame(values)
monthly_stats(combined.vwm, np.arange(1930, 2020, 10))
There are many ways to compare the statistics. I have plotted the statistics in all year 10-year blocks (i.e., 1930-1939, 1931-1940, ...) and the long-run value of the statistic.
import matplotlib.pyplot as plt
plt.rc("figure", figsize=(20, 10))
plt.rc("font", size=16)
def plot_monthly(series, all_decades=False):
decade_stats = {}
full = {}
for col in series:
data = combined[col].dropna()
first = (data.index[0].year // 10 + 1) * 10
last = (data.index[-1].year // 10 + 1) * 10
if all_decades:
first = data.index[0].year
decades = np.arange(first, 2011, 1)
else:
decades = np.arange(first, 2020, 10)
decade_stats[col] = monthly_stats(data, decades)
full[col] = summary_statistics(data)["Monthly"]
fig, ax = plt.subplots(4, 1)
for i, col in enumerate(decade_stats):
for j, stat in enumerate(decade_stats[col].index):
this_stat = decade_stats[col].loc[stat]
x = i * np.ones_like(this_stat)
ax[j].plot(x, this_stat, linewidth=0, marker="o", markersize=12, alpha=0.5)
ax[j].plot(i, full[col].loc[stat], linewidth=0, marker="s", markersize=24)
ax[j].set_title(stat)
for j in range(len(decade_stats[col].index)):
ax[j].set_xticks(np.arange(len(decade_stats)))
if j < (len(decade_stats) - 1):
ax[j].set_xticklabels([])
else:
ax[j].set_xticklabels(list(decade_stats.keys()))
fig.tight_layout(pad=1.0)
plot_monthly(["vwm", "smb", "hml", "mom"], all_decades=True)
plot_monthly(["DAX", "KOSPI", "BVSP", "bonds"], all_decades=True)
plot_monthly(["JPUK", "AUCH"], all_decades=True)
Finally I compute the DoF for all frequencies using both the MLE and the correct moment-based estimator,
$$ \hat{\nu}_{\text{Mom}} = \frac{4\kappa-6}{\kappa-3}, \kappa > 3 $$The MLE uses the same approach as in the companion course.
from scipy.special import gammaln
from scipy.optimize import minimize
def std_t_loglik(nu, x):
# These are fixed for now
mu = 0
sigma2 = 1
sigma = np.sqrt(sigma2)
a = gammaln((nu + 1) / 2)
b = gammaln(nu / 2)
c = np.sqrt(np.pi * (nu - 2))
d = (nu + 1) / 2
e = (x - mu) ** 2
f = sigma2 * (nu - 2)
loglik = a - b - np.log(c) - np.log(sigma) - d * np.log(1 + e / f)
return -(loglik.sum())
def moment_nu(rets):
kappa = rets.kurtosis() + 3
if kappa <= 3:
return float(np.inf)
return float((4 * kappa - 6) / (kappa - 3))
The DoF values are computed from the functions and then stored in a dictionary. The key to the dictionary is a 2-element tuple of the form (column, frequency)
which pandas automatically converts to a MultiIndex
. This simplifies conversion later.
dof = {}
for col in combined:
for freq in ("D", "W-FRI", "M", "Q", "A"):
key = (col, freq)
rets = (
combined[col].dropna().resample(freq).last().dropna().pct_change().dropna()
)
std_rets = (rets - rets.mean()) / rets.std()
starting_val = np.array([10])
opt = minimize(
std_t_loglik,
starting_val,
args=(std_rets,),
bounds=[(2.05, 100)],
options={"disp": False},
)
dof[key] = [float(np.squeeze(opt.x)), moment_nu(rets)]
dof = pd.DataFrame(dof, index=["MLE", "Moment"]).T
dof.head(15)
We can select the MLE
values and then use unstack to transform a 2-level MultiIndex
in the a rectangualr table. This will use the "outer" index value (column) as the new index, and the "inner" column values as the column
mle = dof.MLE.unstack()[["D","W-FRI","M","Q","A"]]
mle
We can plot them by frequency to see if there is any obvious pattern.
ax = mle.T.plot(legend=False,marker="o",markersize=16,linewidth=0)
ax.set_ylim(0,20)
_ = ax.legend(frameon=False,ncol=5)
moment = dof.Moment.unstack()[["D","W-FRI","M","Q","A"]]
moment
ax = moment.T.plot(legend=False,marker="o",markersize=16,linewidth=0)
ax.set_ylim(0,20)
_ = ax.legend(frameon=False,ncol=5,loc=2)