Machine Learning API

Supervised learning models, validation, and preprocessing utilities.

Table of Contents

  1. Overview
  2. Classification Models
    1. KNNClassifier
      1. Options
      2. Methods
        1. .fit()
        2. .predict()
        3. .predictProba()
      3. Example
    2. KNNRegressor
    3. DecisionTreeClassifier
      1. Options
      2. Methods
    4. DecisionTreeRegressor
    5. RandomForestClassifier
      1. Options
      2. Methods
    6. RandomForestRegressor
    7. MLPClassifier
      1. Options
      2. Example
  3. Regression Models
    1. MLPRegressor
    2. PolynomialRegressor
      1. Options
    3. GAMRegressor / GAMClassifier
  4. Clustering
    1. KMeans
      1. Options
      2. Methods
      3. Example
    2. HCA
      1. Options
      2. Methods
      3. Example
  5. Preprocessing
    1. StandardScaler
      1. Methods
    2. MinMaxScaler
  6. Recipe API
    1. Creating a Recipe
    2. Steps
    3. Execution
      1. .prep()
      2. .bake(newData)
    4. Complete Example
  7. Validation
    1. trainTestSplit
    2. crossValidate
  8. Hyperparameter Tuning
    1. GridSearchCV
  9. Metrics
    1. Classification
    2. Regression
  10. Pipelines
    1. Pipeline
    2. GridSearchCV (pipeline-level)
  11. Gaussian Process Regression
    1. GaussianProcessRegressor
      1. Options
      2. Methods
        1. .fit(X, y)
        2. .predict(X, options)
        3. .samplePosterior(X, nSamples, options)
        4. .samplePrior(X, nSamples, options)
      3. Example
    2. Kernels
      1. RBF (Radial Basis Function)
      2. Matern
      3. Periodic
      4. RationalQuadratic
      5. ConstantKernel
      6. SumKernel
      7. Kernel Methods
  12. Clustering (continued)
    1. DBSCAN
      1. Options
      2. Methods
      3. Properties
      4. Example
  13. Outlier Detection
    1. IsolationForest
      1. Options
      2. Methods
      3. Example
    2. LocalOutlierFactor
      1. Options
      2. Methods
      3. Example
    3. MahalanobisDistance
      1. Options
      2. Methods
  14. Missing Data Imputation
    1. SimpleImputer
      1. Options
      2. Methods
      3. Example
    2. KNNImputer
      1. Options
      2. Methods
      3. Features
      4. Example
    3. IterativeImputer
      1. Options
      2. Methods
      3. Example
  15. See Also

Overview

The ds.ml module provides:

  • Models: KNN, Decision Trees, Random Forests, GAMs, MLP (neural networks)
  • Clustering: K-Means, DBSCAN, HCA
  • Preprocessing: Scaling, encoding, pipelines
  • Validation: Train/test split, cross-validation
  • Tuning: Grid search
  • Metrics: Accuracy, R-squared, RMSE, F1
  • Recipe API: Chainable preprocessing workflows

Classification Models

KNNClassifier

K-Nearest Neighbors classifier.

new ds.ml.KNNClassifier(options)

Options

{
  k: number,              // Number of neighbors (default: 5)
  weight: string,         // 'uniform' or 'distance' (default: 'uniform')
  metric: string          // 'euclidean' (default)
}

Methods

.fit()

Array API:

model.fit(X, y)

Table API:

model.fit({
  data: trainData,
  X: ['feature1', 'feature2'],
  y: 'label',
  encoders: metadata.encoders  // Optional
})
.predict()

Array API:

const predictions = model.predict(XTest)

Table API:

const predictions = model.predict({
  data: testData,
  X: ['feature1', 'feature2'],
  encoders: metadata.encoders  // Optional: decode to strings
})
.predictProba()

Get probability estimates.

const probabilities = model.predictProba(XTest)
// [[0.8, 0.2], [0.3, 0.7], ...]

Example

const knn = new ds.ml.KNNClassifier({ k: 5, weight: 'distance' });

knn.fit({
  data: trainData,
  X: ['sepal_length', 'sepal_width'],
  y: 'species'
});

const predictions = knn.predict({
  data: testData,
  X: ['sepal_length', 'sepal_width']
});

KNNRegressor

K-Nearest Neighbors regressor.

new ds.ml.KNNRegressor(options)

Same API as KNNClassifier. Returns continuous predictions instead of classes.


DecisionTreeClassifier

Decision tree for classification.

new ds.ml.DecisionTreeClassifier(options)

Options

{
  maxDepth: number,       // Maximum tree depth (default: Infinity)
  minSamplesSplit: number // Minimum samples to split (default: 2)
}

Methods

  • .fit(X, y) or .fit({ data, X, y })
  • .predict(X) or .predict({ data, X })
  • .predictProba(X) - Probability estimates

DecisionTreeRegressor

Decision tree for regression.

new ds.ml.DecisionTreeRegressor(options)

Same options as DecisionTreeClassifier.


RandomForestClassifier

Random forest ensemble for classification.

new ds.ml.RandomForestClassifier(options)

Options

{
  nEstimators: number,    // Number of trees (default: 100)
  maxDepth: number,       // Max depth per tree (default: Infinity)
  maxFeatures: string,    // Feature subset: 'sqrt', 'log2', or number
  seed: number            // Random seed
}

Methods

  • .fit(X, y) or .fit({ data, X, y })
  • .predict(X) or .predict({ data, X })
  • .predictProba(X) - Probability estimates
  • .featureImportances() - Feature importance scores

RandomForestRegressor

Random forest ensemble for regression.

new ds.ml.RandomForestRegressor(options)

Same API as RandomForestClassifier.


MLPClassifier

Multilayer Perceptron (neural network) classifier.

new ds.ml.MLPClassifier(options)

Options

{
  hiddenLayers: Array<number>,  // Neurons per hidden layer (default: [100])
  activation: string,           // 'relu', 'tanh', 'sigmoid' (default: 'relu')
  learningRate: number,         // Learning rate (default: 0.001)
  maxIter: number,              // Maximum iterations (default: 200)
  batchSize: number,            // Batch size (default: 'auto')
  solver: string,               // 'adam', 'sgd' (default: 'adam')
  alpha: number,                // L2 regularization (default: 0.0001)
  earlyStop: boolean,           // Early stopping (default: false)
  validationFraction: number    // Validation split (default: 0.1)
}

Example

const mlp = new ds.ml.MLPClassifier({
  hiddenLayers: [50, 30],
  activation: 'relu',
  learningRate: 0.01,
  maxIter: 300
});

mlp.fit({
  data: scaledTrainData,  // MLP requires scaled features
  X: features,
  y: 'species'
});

const predictions = mlp.predict({ data: scaledTestData, X: features });

Regression Models

MLPRegressor

Multilayer Perceptron for regression.

new ds.ml.MLPRegressor(options)

Same options as MLPClassifier.

const mlp = new ds.ml.MLPRegressor({
  hiddenLayers: [64, 32],
  activation: 'relu',
  learningRate: 0.001,
  maxIter: 500
});

mlp.fit({
  data: scaledTrain,
  X: ['carat', 'depth', 'table'],
  y: 'price'
});

const predictions = mlp.predict({ data: scaledTest, X: features });

PolynomialRegressor

Polynomial regression.

new ds.ml.PolynomialRegressor(options)

Options

{
  degree: number  // Polynomial degree (default: 2)
}

GAMRegressor / GAMClassifier

Generalized Additive Models.

new ds.ml.GAMRegressor(options)
new ds.ml.GAMClassifier(options)

Clustering

KMeans

Partition data into k clusters minimizing within-cluster sum of squares.

new ds.ml.KMeans(options)

Options

{
  k: number,       // Number of clusters (default: 3)
  maxIter: number,  // Maximum iterations (default: 300)
  tol: number,      // Convergence tolerance (default: 1e-4)
  seed: number      // Random seed
}

Methods

  • .fit(X) or .fit({ data, columns })
  • .predict(X) - Assign new points to nearest centroid
  • .silhouetteScore(X, labels) - Compute silhouette score
  • .summary() - Iterations, inertia, convergence, centroids
  • .toJSON() / KMeans.fromJSON() - Persistence

Example

const km = new ds.ml.KMeans({ k: 3, seed: 42 });
km.fit({
  data: iris,
  columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
});

console.log(km.labels);     // Cluster assignments
console.log(km.centroids);  // Cluster centers

HCA

Hierarchical agglomerative clustering.

new ds.ml.HCA(options)

Options

{
  linkage: string,      // 'single', 'complete', 'average', 'ward' (default: 'average')
  omit_missing: boolean  // Drop rows with NaN (default: true)
}

Methods

  • .fit(X) or .fit({ data, columns })
  • .cut(k) - Return cluster labels for k clusters
  • .cutHeight(height) - Cut dendrogram at a distance threshold
  • .summary() - Linkage, observations, merge count, max distance
  • .toJSON() / HCA.fromJSON()

Example

const hca = new ds.ml.HCA({ linkage: 'ward' });
hca.fit({
  data: penguins,
  columns: ['bill_length_mm', 'flipper_length_mm', 'body_mass_g']
});

const labels = hca.cut(3);

Preprocessing

StandardScaler

Standardize features to zero mean and unit variance: z = (x - mean) / std.

new ds.ml.preprocessing.StandardScaler()

Methods

  • .fit({ data, columns }) - Compute mean and std from training data
  • .transform({ data, columns }) - Apply standardization
  • .fitTransform(X) - Fit and transform in one step

Important: Always fit on training data, then transform both train and test.

const scaler = new ds.ml.preprocessing.StandardScaler();
scaler.fit({ data: trainData, columns: numericFeatures });

const trainScaled = scaler.transform({ data: trainData, columns: numericFeatures });
const testScaled = scaler.transform({ data: testData, columns: numericFeatures });

MinMaxScaler

Scale features to [0, 1] range: x_scaled = (x - min) / (max - min).

new ds.ml.preprocessing.MinMaxScaler()

Same API as StandardScaler.


Recipe API

Chainable preprocessing workflows.

const recipe = ds.ml.recipe(config)
  .parseNumeric(columns)
  .oneHot(columns, options)
  .scale(columns, options)
  .split(options)

Creating a Recipe

const recipe = ds.ml.recipe({
  data: myData,
  X: ['feature1', 'feature2', 'category'],
  y: 'target'
})

Steps

Method Description
.parseNumeric(columns) Convert string columns to numbers
.clean(validCategories) Remove rows with invalid categories
.oneHot(columns, { dropFirst }) One-hot encode categorical columns
.scale(columns, { method }) Scale numeric columns ('standard' or 'minmax')
.split({ ratio, shuffle, seed }) Split into train/test sets

Execution

.prep()

Execute recipe and fit all transformers on training data.

const prepped = recipe.prep()

Returns:

{
  train: { data, X, y, metadata },
  test: { data, X, y, metadata },
  transformers: { scale, oneHot },
  steps: [...]
}

.bake(newData)

Apply fitted transformers to new data.

const newPrepped = recipe.bake(newData)

Complete Example

const recipe = ds.ml.recipe({
  data: diamondsData,
  X: ['carat', 'depth', 'table', 'cut', 'color'],
  y: 'price'
})
  .parseNumeric(['carat', 'depth', 'table', 'price'])
  .oneHot(['cut', 'color'], { dropFirst: false })
  .scale(['carat', 'depth', 'table'], { method: 'standard' })
  .split({ ratio: 0.8, shuffle: true, seed: 42 });

const prepped = recipe.prep();

const model = new ds.ml.MLPRegressor({ hiddenLayers: [64, 32] });
model.fit({
  data: prepped.train.data,
  X: prepped.train.X,
  y: prepped.train.y
});

// Apply to new data (uses fitted transformers)
const newPrepped = recipe.bake(newDiamonds);
const predictions = model.predict({ data: newPrepped.data, X: newPrepped.X });

Validation

trainTestSplit

Split data into training and testing sets.

Table API:

const split = ds.ml.validation.trainTestSplit(
  { data: myData, X: features, y: 'target' },
  { ratio: 0.7, shuffle: true, seed: 42 }
)

// Returns: { train: { data, X, y, metadata }, test: { ... } }

crossValidate

Perform k-fold cross-validation.

const cv = ds.ml.validation.crossValidate(
  (Xtr, ytr) => new ds.ml.KNNClassifier({ k: 5 }).fit(Xtr, ytr),
  (model, Xte, yte) => ds.ml.metrics.accuracy(yte, model.predict(Xte)),
  { data: myData, X: features, y: 'species' },
  { k: 5, shuffle: true }
);

console.log(`Mean accuracy: ${cv.scores.mean()}`);
console.log(`Std: ${cv.scores.std()}`);

Hyperparameter Tuning

GridSearchCV

Exhaustive search over parameter grid.

const paramGrid = {
  k: [3, 5, 7, 11],
  weight: ['uniform', 'distance']
};

const grid = ds.ml.tuning.GridSearchCV(
  (Xtr, ytr, params) => new ds.ml.KNNClassifier(params).fit(Xtr, ytr),
  (model, Xte, yte) => ds.ml.metrics.accuracy(yte, model.predict(Xte)),
  { data: trainData, X: features, y: 'species' },
  null,
  paramGrid,
  { k: 5, shuffle: true }
);

console.log('Best params:', grid.bestParams);
console.log('Best score:', grid.bestScore);

Metrics

Classification

ds.ml.metrics.accuracy(yTrue, yPred)       // Fraction correct
ds.ml.metrics.confusionMatrix(yTrue, yPred) // 2D array
ds.ml.metrics.f1Score(yTrue, yPred)         // F1 score

Regression

ds.ml.metrics.r2Score(yTrue, yPred)  // R-squared (1.0 = perfect)
ds.ml.metrics.rmse(yTrue, yPred)     // Root Mean Squared Error
ds.ml.metrics.mae(yTrue, yPred)      // Mean Absolute Error
ds.ml.metrics.mse(yTrue, yPred)      // Mean Squared Error

Pipelines

Pipeline

Chain preprocessing and model steps.

new ds.ml.Pipeline(steps)

GridSearchCV (pipeline-level)

ds.ml.GridSearchCV(fitFn, scoreFn, X, y, paramGrid, { k: 5 })

Gaussian Process Regression

GaussianProcessRegressor

Gaussian Process regression with uncertainty quantification.

new ds.ml.GaussianProcessRegressor(options)

Options

{
  kernel: string|Kernel,  // 'rbf', 'periodic', 'matern', 'rationalquadratic', 'constant', or Kernel instance
  lengthScale: number,    // Length scale (default: 1.0)
  variance: number,       // Signal variance / amplitude (default: 1.0)
  alpha: number,          // Noise level / regularization (default: 1e-10)
  period: number,         // Period for periodic kernel
  nu: number              // Smoothness for Matern kernel (0.5, 1.5, 2.5, or Infinity)
}

Methods

.fit(X, y)

Fit the GP to training data.

gp.fit(X_train, y_train)
.predict(X, options)

Make predictions with optional uncertainty.

// Mean predictions only
const predictions = gp.predict(X_test);

// With standard deviation
const { mean, std } = gp.predict(X_test, { returnStd: true });

// With full covariance matrix
const { mean, covariance } = gp.predict(X_test, { returnCov: true });
.samplePosterior(X, nSamples, options)

Draw samples from the posterior distribution.

const samples = gp.samplePosterior(X_test, 5, { seed: 42 });
// Returns array of 5 sample functions evaluated at X_test
.samplePrior(X, nSamples, options)

Draw samples from the prior distribution (before seeing data).

const priorSamples = gp.samplePrior(X_test, 3);

Example

const gp = new ds.ml.GaussianProcessRegressor({
  kernel: 'rbf',
  lengthScale: 1.0,
  variance: 1.0,
  alpha: 0.1
});

gp.fit(X_train, y_train);

const { mean, std } = gp.predict(X_test, { returnStd: true });

// Draw posterior samples for visualization
const samples = gp.samplePosterior(X_test, 10);

Kernels

Kernel functions for Gaussian Processes. All kernels support both positional and object-style construction.

RBF (Radial Basis Function)

Also known as Squared Exponential or Gaussian kernel. Produces very smooth functions.

new ds.ml.RBF(lengthScale, variance)
// or
new ds.ml.RBF({ lengthScale: 1.0, amplitude: 1.0 })

Formula: k(x1, x2) = variance * exp(-||x1 - x2||^2 / (2 * lengthScale^2))

Matern

Matern kernel with configurable smoothness. More flexible than RBF.

new ds.ml.Matern({ lengthScale: 1.0, nu: 1.5, amplitude: 1.0 })

Supported nu values:

  • 0.5 - Exponential kernel (rough, non-differentiable)
  • 1.5 - Once differentiable (default)
  • 2.5 - Twice differentiable
  • Infinity - Equivalent to RBF (infinitely differentiable)

Periodic

For modeling repeating/seasonal patterns.

new ds.ml.Periodic(lengthScale, period, variance)

Parameters:

  • period - Distance between repetitions
  • lengthScale - Smoothness within each period

RationalQuadratic

Mixture of RBF kernels with different length scales. Good for multi-scale patterns.

new ds.ml.RationalQuadratic(lengthScale, alpha, variance)
// or
new ds.ml.RationalQuadratic({ lengthScale: 1.0, alpha: 1.0, amplitude: 1.0 })

ConstantKernel

Returns a constant covariance. Useful for combining with other kernels.

new ds.ml.ConstantKernel({ value: 1.0 })

SumKernel

Combines multiple kernels by summing their outputs.

new ds.ml.SumKernel({
  kernels: [new ds.ml.RBF(1.0), new ds.ml.Periodic(1.0, 7.0)]
})

Kernel Methods

All kernels support:

  • .compute(x1, x2) - Compute covariance between two points
  • .call(X1, X2) - Compute covariance matrix between sets of points
  • .getParams() - Get current parameters
  • .setParams(params) - Update parameters

Clustering (continued)

DBSCAN

Density-Based Spatial Clustering of Applications with Noise. Finds clusters of arbitrary shape and identifies outliers as noise.

new ds.ml.DBSCAN(options)

Options

{
  eps: number,        // Maximum distance for neighborhood (default: 0.5)
  minSamples: number  // Minimum points to form dense region (default: 5)
}

Methods

  • .fit(X) or .fit({ data, columns }) - Cluster the data
  • .predict(X) - Assign new points to nearest cluster or noise (-1)
  • .summary() - Get clustering statistics

Properties

dbscan.labels           // Cluster assignments (-1 = noise, 0+ = cluster ID)
dbscan.nClusters        // Number of clusters found
dbscan.nNoise           // Number of noise points
dbscan.coreSampleIndices // Indices of core points
dbscan.coreSampleMask   // Boolean mask for core points
dbscan.components       // Core sample data points

Example

const dbscan = new ds.ml.DBSCAN({ eps: 0.3, minSamples: 5 });
dbscan.fit({
  data: myData,
  columns: ['x', 'y']
});

console.log(`Found ${dbscan.nClusters} clusters`);
console.log(`Noise points: ${dbscan.nNoise}`);
console.log(dbscan.labels);  // [-1, 0, 0, 1, 1, -1, ...]

Outlier Detection

IsolationForest

Tree-based anomaly detection. Outliers are isolated in fewer splits.

new ds.ml.IsolationForest(options)

Options

{
  n_estimators: number,   // Number of trees (default: 100)
  max_samples: number,    // Samples per tree (default: 'auto' = min(256, n))
  contamination: number,  // Expected outlier proportion (default: 0.1)
  random_state: number    // Random seed
}

Methods

  • .fit(X) or .fit({ data, columns, group }) - Fit the model
  • .predict(X) - Returns -1 for outliers, 1 for inliers
  • .score_samples(X) - Anomaly scores (lower = more anomalous)
  • .fit_predict(X) - Fit and predict in one step

Example

const iso = new ds.ml.IsolationForest({ contamination: 0.1 });
iso.fit({ data: myData, columns: ['feature1', 'feature2'] });

const predictions = iso.predict({ data: myData, columns: ['feature1', 'feature2'] });
// Returns array with -1 for outliers, 1 for inliers

LocalOutlierFactor

Density-based outlier detection using local density deviation.

new ds.ml.LocalOutlierFactor(options)

Options

{
  n_neighbors: number,    // Number of neighbors (default: 20)
  contamination: number,  // Expected outlier proportion (default: 0.1)
  novelty: boolean        // If true, can predict on new data (default: false)
}

Methods

  • .fit(X) - Fit the model
  • .fit_predict(X) - Fit and predict (for novelty=false)
  • .negative_outlier_factor - LOF scores (more negative = more anomalous)

Example

const lof = new ds.ml.LocalOutlierFactor({ n_neighbors: 20 });
const predictions = lof.fit_predict(X_train);
// -1 for outliers, 1 for inliers

MahalanobisDistance

Statistical distance-based outlier detection accounting for covariance.

new ds.ml.MahalanobisDistance(options)

Options

{
  contamination: number  // Expected outlier proportion (default: 0.1)
}

Methods

  • .fit(X) - Fit the model (compute mean and covariance)
  • .predict(X) - Returns -1 for outliers, 1 for inliers
  • .score_samples(X) - Negative Mahalanobis distances
  • .fit_predict(X) - Fit and predict in one step

Missing Data Imputation

SimpleImputer

Fill missing values with statistical measures.

new ds.ml.SimpleImputer(options)

Options

{
  strategy: string,     // 'mean', 'median', 'most_frequent', or 'constant'
  fill_value: any       // Value for 'constant' strategy
}

Methods

  • .fit(X) or .fit({ data, columns, group }) - Learn statistics from data
  • .transform(X) - Fill missing values
  • .fit_transform(X) - Fit and transform in one step

Example

const imputer = new ds.ml.SimpleImputer({ strategy: 'mean' });
imputer.fit({ data: trainData, columns: ['age', 'income'] });

const filled = imputer.transform({ data: testData, columns: ['age', 'income'] });

KNNImputer

Fill missing values using k-nearest neighbors.

new ds.ml.KNNImputer(options)

Options

{
  n_neighbors: number,  // Number of neighbors (default: 5)
  weights: string       // 'uniform' or 'distance' (default: 'uniform')
}

Methods

  • .fit(X) - Store training data
  • .transform(X) - Impute missing values
  • .fit_transform(X) - Fit and transform in one step

Features

  • Supports mixed numeric and categorical data
  • Uses Gower distance for mixed types
  • Categorical columns imputed with weighted mode

Example

const imputer = new ds.ml.KNNImputer({ n_neighbors: 5, weights: 'distance' });
const filled = imputer.fit_transform({
  data: myData,
  columns: ['age', 'income', 'category']
});

IterativeImputer

Multivariate imputation using chained equations (MICE algorithm). Models each feature as a function of others.

new ds.ml.IterativeImputer(options)

Options

{
  initial_strategy: string,  // Initial fill strategy (default: 'mean')
  max_iter: number,          // Maximum iterations (default: 10)
  tol: number,               // Convergence tolerance (default: 1e-3)
  min_value: number,         // Minimum imputed value (default: -Infinity)
  max_value: number,         // Maximum imputed value (default: Infinity)
  verbose: boolean           // Print progress (default: false)
}

Methods

  • .fit(X) - Fit initial imputer
  • .transform(X) - Iteratively impute missing values
  • .fit_transform(X) - Fit and transform in one step

Example

const imputer = new ds.ml.IterativeImputer({ max_iter: 10, verbose: true });
const filled = imputer.fit_transform({
  data: myData,
  columns: ['feature1', 'feature2', 'feature3']
});

See Also


Back to top

This site uses Just the Docs, a documentation theme for Jekyll.