I found this open data from from Technical University of Delft.

#collapse
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import altair as alt
from io import StringIO
import re
import urllib

5-Step Systematic Process

I will approach this by trying to following the the following process:

Define the Problem
Prepare Data
Spot Check Algorithms
Improve Results
Present Results

As described here)

1. Define the Problem

Step 1: What is the problem? Describe the problem informally and formally and list assumptions and similar problems.

To predict the residuary resistance of sailing yachts to be used in the the initial design stage.

Step 2: Why does the problem need to be solved? List your motivation for solving the problem, the benefits a solution provides and how the solution will be used.

Martin needs practice.

Step 3: How would I solve the problem? Describe how the problem would be solved manually to flush domain knowledge.

Regression on the Delft data set will be conducted. The regression will be performed initially in the most simplistic way, where each row in the data will be treated as a unique individual. But in reality the rows are of course connected to each other if they are from the same ship. This will maybe be handled at a later stage.

2. Prepare Data

I followed a process as described here.

Step 1: Data Selection:

Consider what data is available, what data is missing and what data can be removed. The Delft data will be used which comprises 308 full-scale experiments, which were performed at the Delft Ship Hydromechanics Laboratory for that purpose. These experiments include 22 different hull forms, derived from a parent form closely related to the Standfast 43 designed by Frans Maas. I think that all of the data can be used

Step 2: Data Preprocessing:

Organize your selected data by formatting, cleaning and sampling from it.

Load data

Columns:

Column	Variable	Description
1.	lcg	Longitudinal position of the center of buoyancy, adimensional.
2.	cp	Prismatic coefficient, adimensional.
3.	volume	Length-displacement ratio, adimensional.
4.	b/d	Beam-draught ratio, adimensional.
5.	l/b	Length-beam ratio, adimensional.
6.	fn	Froude number, adimensional.
7.	r	residuary resistance per unit weight of displacement, adimensional

#collapse
columns = [
'lcg',   
'cp',    
'volume',
'b/d',   
'l/b',   
'fn',    
'r', 
]

data_url = r'http://archive.ics.uci.edu/ml/machine-learning-databases/00243/yacht_hydrodynamics.data'
with urllib.request.urlopen(data_url) as file:
    s_raw=file.read().decode("utf-8")
    
# remove some dirt:
regexp = re.compile(r' \n', flags=re.DOTALL)
s1 = regexp.sub('\n', s_raw)

regexp = re.compile(r' +', flags=re.DOTALL)
s2 = regexp.sub(' ', s1)
s2[0:200]
s=s2

data = StringIO(s)
data = pd.read_csv(data, sep=' ', encoding='utf-8', names=columns)

features = list(set(columns)-set(['r']))
label = 'r'

X=data[features].copy()
y=data[label].copy()

Initial look at data

#collapse
alt.Chart(data).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
    alt.Y(alt.repeat("row"), type='quantitative'),

).properties(
    width=100,
    height=150
).repeat(
    row=[label],
    column=features,
).interactive()

Step 3: Data Transformation:

Transform preprocessed data ready for machine learning by engineering features using scaling, attribute decomposition and attribute aggregation.

Scaling

The features are already made nondimensional, however it seems that the size of them differ a bit (ex: b/d in [3,5] and cp in [0.5, 0.6] so it might be worth to conduct some kind of more scaling to make them more equal.

Decomposition

This is something that I want to look more into, but not now.

Aggregation

No obvious aggregrations here

3. Spot Check Algorithms

#collapse
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=1)

First just looking at first order linear regression:

#collapse

linear_regression = LinearRegression()
steps = [
    ('linear_regression', linear_regression),
]

pipeline_linear = Pipeline(steps=steps)

scores = cross_val_score(estimator=pipeline_linear, X=X, y=y, scoring='r2', cv=cv, n_jobs=-1)

fig,ax = plt.subplots()
ax.hist(scores);
ax.set_xlabel('score')
ax.set_ylabel('occurances')
ax.set_title('Histogram over cross validations');

Add a scaler:

#collapse
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()

steps = [
    ('scaler', standard_scaler),
    ('linear_regression', linear_regression),
]

pipeline_linear_scaled = Pipeline(steps=steps)

Add polynomial features:

#collapse
from sklearn.preprocessing import PolynomialFeatures

polynomial_features = PolynomialFeatures(degree=2)

steps = [
    ('scaler', standard_scaler),
    ('polynomial_features', polynomial_features),
    ('linear_regression', linear_regression),
]

pipeline_polynomial_scaled = Pipeline(steps=steps)

Add feature selection:

#collapse

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

select_k_best = SelectKBest(score_func=f_regression, k=4)

steps = [
    ('scaler', standard_scaler),
    ('polynomial_features', polynomial_features),
    ('select_k_best', select_k_best),
    ('linear_regression', linear_regression),
]

pipeline_polynomial_scaled_selection = Pipeline(steps=steps)

Spot checking

#collapse

models = {
    'linear':pipeline_linear,
    'linear scaled':pipeline_linear_scaled,
    'polynomial scaled':pipeline_polynomial_scaled,
    'polynomial scaled with feature selection':pipeline_polynomial_scaled_selection
         }

scores = {}
for model_name, model in models.items():
    scores[model_name] = cross_val_score(estimator=model, X=X, y=y, scoring='r2', cv=cv, n_jobs=-1)
    
df_cross_validation = pd.DataFrame()

for model_name, model in models.items():

    scores_ = cross_val_score(estimator=model, X=X, y=y, scoring='r2', cv=cv, n_jobs=-1)
    validations = np.arange(0,len(scores_))
    df_=pd.DataFrame()
    df_['validation']=validations
    df_['score']=scores_
    df_['model']=model_name
    df_cross_validation=df_cross_validation.append(df_, ignore_index=True)

df_scores = pd.DataFrame(scores)

ax = sns.boxplot(x='model', y='score', data=df_cross_validation)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90);

...so the spot checking tells us that it is worth to use:

A higher order polynomial
And to do feature selection

4. Improve Results

Algorithm Tuning:

where discovering the best models is treated like a search problem through model parameter space.

The optimal parameters to the model: polynomial scaled with feature selection will be found by conducting a grid search.

#collapse

df_scores = pd.DataFrame(scores)

ax = sns.boxplot(x='model', y='score', data=df_cross_validation)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90);

from sklearn.model_selection import GridSearchCV

# define the grid
grid = dict()
grid['select_k_best__k'] = [i for i in range(X.shape[1]-20, X.shape[1]+1)]
grid['polynomial_features__degree'] = [i for i in range(1, 10)]


# define the grid search
search = GridSearchCV(estimator=pipeline_polynomial_scaled_selection, param_grid=grid, scoring='neg_mean_absolute_error', n_jobs=-1, cv=cv)
# perform the search
search_result = search.fit(X, y)

model = search_result.best_estimator_
model

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('polynomial_features',
                 PolynomialFeatures(degree=4, include_bias=True,
                                    interaction_only=False, order='C')),
                ('select_k_best',
                 SelectKBest(k=6,
                             score_func=<function f_regression at 0x157639C0>)),
                ('linear_regression',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

The model is the same thing as the following polynomial:

#collapse
def find_polynomial_feature(model):
    found = False
    for part in model:
        if isinstance(part, PolynomialFeatures):
            polynomial_features = part
            found = True
            break
    if not found:
        raise ValueError('model pipeline must contain an instance of PolynomialFeatures')
    
    return polynomial_features

def find_select_k_best(model):
    found = False
    for part in model:
        if isinstance(part, SelectKBest):
            select_k_best = part
            found = True
            break
    if not found:
        raise ValueError('model pipeline must contain an instance of SelectKBest')
        
    return select_k_best

def model_to_string(model:sklearn.pipeline.Pipeline, feature_names:list, divide=' '):
    
    # Find polynomial features:
    polynomial_features = find_polynomial_feature(model=model)
    
    # Find select_k_best:
    select_k_best = find_select_k_best(model=model)
    
    polynomial_feature_names = np.array(polynomial_features.get_feature_names())
    best_polynomial_feature_names = polynomial_feature_names[select_k_best.get_support()]
    
    predictor = model[-1]  # Last item in the pipeline is assumed to be the precictor
    coefficients = predictor.coef_
    interception = predictor.intercept_
    
    x_names = ['x%i'%i for i in range(len(feature_names))]
    
    expression = ''
    expression+='%f' % interception
    for part,coefficient in zip(best_polynomial_feature_names,coefficients):
        
        nice_part = part.replace(' ','*')
        super_nice_part = nice_part
        for feature_name,x in zip(feature_names,x_names):
            super_nice_part=super_nice_part.replace(x,feature_name)
        
        if coefficient==0:
            continue
        elif coefficient<0:
            sign=''
        else:
            sign='+'
        
        sub_part = '%s%s%s%f*%s' % (divide,sign,divide,coefficient,super_nice_part)
    
        
        expression+=sub_part
    
    return expression

print(model_to_string(model=model, feature_names=features))

2.936902 + 3.458262*fn + 4.446093*fn^2 + 4.785126*fn^3 + 0.315299*fn*cp^2  -0.066452*fn*volume^2 + 1.740994*fn^4

And a bit nicer with model->sympy->latex:

#collapse
from sympy.parsing.sympy_parser import parse_expr
def model_to_sympy(model:sklearn.pipeline.Pipeline, feature_names:list, label='y'):
    
    model_string = model_to_string(model=model, feature_names=feature_names)
    model_string = model_string.replace('^','**')
    lhs = sp.Symbol(label)
    rhs = parse_expr(model_string)
    sympy_expression = sp.Eq(lhs=lhs, rhs=rhs)
    return sympy_expression

features_latex = {
    'b/d' : r'\frac{b}{d}',
    'l/b' : r'\frac{l}{b}', 
    'lcg' : r'L_{cg}',
    'fn' : 'F_n',
    'cp' : 'C_p', 
    'volume' : 'V',
}
latex_features = [features_latex[key] for key in features]

model_to_sympy(model=model, feature_names=latex_features, label='R')

Ensemble Methods:

where the predictions made by multiple models are combined.

Extreme Feature Engineering:

where the attribute decomposition and aggregation seen in data preparation is pushed to the limits.

scaler=model['scaler']
scaler.transform(X)

array([[ 0.09717082, -0.14870298,  0.05415696, -1.61245155,  0.16616243,
        -0.03418367],
       [ 0.09717082, -0.14870298,  0.05415696, -1.36438208,  0.16616243,
        -0.03418367],
       [ 0.09717082, -0.14870298,  0.05415696, -1.11631261,  0.16616243,
        -0.03418367],
       ...,
       [ 0.53568528, -1.92579533,  0.05415696,  1.11631261,  1.5423783 ,
        -1.7757515 ],
       [ 0.53568528, -1.92579533,  0.05415696,  1.36438208,  1.5423783 ,
        -1.7757515 ],
       [ 0.53568528, -1.92579533,  0.05415696,  1.61245155,  1.5423783 ,
        -1.7757515 ]])

X/scaler.transform(X)

	b/d	l/b	lcg	fn	cp	volume
0	41.061711	-21.317664	-42.469151	-0.077522	3.418342	-139.832866
1	41.061711	-21.317664	-42.469151	-0.109940	3.418342	-139.832866
2	41.061711	-21.317664	-42.469151	-0.156766	3.418342	-139.832866
3	41.061711	-21.317664	-42.469151	-0.230350	3.418342	-139.832866
4	41.061711	-21.317664	-42.469151	-0.362802	3.418342	-139.832866
...	...	...	...	...	...	...
303	7.896428	-1.417596	-42.469151	0.564358	0.389010	-2.444036
304	7.896428	-1.417596	-42.469151	0.431907	0.389010	-2.444036
305	7.896428	-1.417596	-42.469151	0.358323	0.389010	-2.444036
306	7.896428	-1.417596	-42.469151	0.311496	0.389010	-2.444036
307	7.896428	-1.417596	-42.469151	0.279078	0.389010	-2.444036