edamame.classifier#

classification#

class edamame.classifier.classification.TrainClassifier(X_train, y_train, X_test, y_test)[source]#

Bases: object

This class represents a pipeline for training and handling classification models.

X_train#

The input training data.

Type:

pd.DataFrame

y_train#

The target training data.

Type:

pd.Series

X_test#

The input test data.

Type:

pd.DataFrame

y_test#

The target test data.

Type:

pd.Series

Example

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> logistic = classifier.logistic()
>>> classifier.model_metrics(model_name="logisitc")
>>> classifier.model_save(model_name="logisitc")
>>> nb = classifier.gaussian_nb()
>>> knn = classifier.knn()
>>> tree = classifier.tree()
>>> rf = classifier.random_forest()
>>> xgb = classifier.xgboost()
>>> svm = classifier.svm()
>>> classifier.model_metrics()
>>> # using AutoML
>>> models = classifier.auto_ml()
>>> classifier.save_model()
auto_ml(n_folds: int = 5, data: Literal['train', 'test'] = 'train') List[source]#

Perform automated machine learning with cross validation on a list of classification models.

Parameters:
  • n_folds (int) – Number of cross-validation folds. Defaults to 5.

  • data (Literal['train', 'test']) – Target dataset for cross-validation. Must be either ‘train’ or ‘test’. Defaults to ‘train’.

Returns:

List of best-fit classification models for each algorithm.

Return type:

List

Example

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> model_list = classifier.auto_ml()
gaussian_nb(**kwargs) GaussianNB[source]#

Trains a Gaussian Naive Bayes classifier using the training data and returns the fitted model.

Parameters:

**kwargs – Arbitrary keyword arguments to be passed to the Gaussian NB constructor.

Returns:

The trained Gaussian Naive Bayes classifier.

Return type:

GaussianNB

Example

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> nb = classifier.gaussian_nb()
knn(n_neighbors: Tuple[int, int, int] = (1, 50, 50), n_folds: int = 5, **kwargs) KNeighborsClassifier[source]#

Train a k-Nearest Neighbors classification model using the training data, and perform a grid search to find the best value of ‘n_neighbors’ hyperparameter.

Parameters:
  • n_neighbors (Tuple[int, int, int]) – A tuple with three integers. The first and second integers are the range of the ‘n_neighbors’ hyperparameter that will be searched by the grid search, and the third integer is the number of values to generate in the interval [n_neighbors[0], n_neighbors[1]]. Default is [1, 50, 50].

  • n_folds (int) – The number of cross-validation folds to use for the grid search. Default is 5.

  • **kwargs – Arbitrary keyword arguments to be passed to the KNN constructor.

Returns:

The trained k-Nearest Neighbors classification model with the best ‘n_neighbors’ hyperparameter found by the grid search.

Return type:

KNeighborsClassifier

Example

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> knn = classifier.knn(n_neighbors=(1,50,50), n_folds=3)
logistic(**kwargs) LogisticRegression[source]#

Trains a logistic regression model using the training data and returns the fitted model.

Parameters:

**kwargs – Arbitrary keyword arguments to be passed to the logistic constructor.

Returns:

The trained logistic regression model.

Return type:

LogisticRegression

Example

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> logistic = classifier.logistic()
model_metrics(model_name: Literal['all', 'logistic', 'gaussian_nb', 'knn', 'tree', 'random_forest', 'xgboost', 'svm'] = 'all', cm: bool = False) None[source]#

Display classification metrics (confusion matrix and classification report) for specified or all trained models.

Parameters:
  • model_name (Literal["all", "logistic", "guassian_nb", "knn", "tree", "random_forest", "xgboost", "svm"]) – The name of the model to display the metrics for. Defaults to ‘all’.

  • cm (bool) – Whether to display the confusion matrix. Defaults to False.

Returns:

None

Example

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> xgboost = classifier.xgboost(n_estimators=(10, 100, 5), n_folds=2)
>>> classifier.model_metrics(model_name="xgboost")
random_forest(n_estimators: Tuple[int, int, int] = (50, 1000, 5), n_folds: int = 2, **kwargs) RandomForestClassifier[source]#

Train a Random Forest classifier using the training data and return the fitted model.

Parameters:
  • n_estimators (Tuple[int, int, int]) – The range of the number of trees in the forest. Default is (50, 1000, 5).

  • n_folds (int) – The number of folds in cross-validation. Default is 2.

  • **kwargs – Arbitrary keyword arguments to be passed to the random forest constructor.

Returns:

The trained Random Forest classifier.

Return type:

RandomForestClassifier

Example

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> rf = classifier.random_forest(n_estimators=(50, 1000, 5), n_folds=2)
save_model(model_name: Literal['all', 'logistic', 'gaussian_nb', 'knn', 'tree', 'random_forest', 'xgboost', 'svm'] = 'all') None[source]#

Saves the specified machine learning model or all models in the instance to a pickle file.

Parameters:

model_name (Literal["all", "linear", "lasso", "ridge", "tree", "random_forest", "xgboost", "svm"]) – The name of the model to save. Defaults to ‘all’.

Returns:

None

Example

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> model_list = classifier.auto_ml()
>>> classifier.save_model(model_name="all")
svm(n_folds: int = 2, **kwargs) SVC[source]#

Trains an SVM classifier using the training data and returns the fitted model.

Parameters:
  • n_folds (int) – The number of folds in cross-validation. Default is 2.

  • **kwargs – Arbitrary keyword arguments to be passed to the SVC constructor.

Returns:

The trained SVM classifier.

Return type:

SVC

Example

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> svm = classifier.svm(kernel="linear", C=1.0, gamma="auto")
tree(alpha: Tuple[float, float, int] = (0.0, 0.001, 5), impurity: Tuple[float, float, int] = (0.0, 1e-05, 5), n_folds: int = 5, **kwargs) DecisionTreeClassifier[source]#

Trains a decision tree classifier using the training data and returns the fitted model.

Parameters:
  • alpha (Tuple[float, float, int]) – A tuple containing the minimum and maximum values of ccp_alpha and the number of values to try (default: (0., 0.001, 5)).

  • impurity (Tuple[float, float, int]) – A tuple containing the minimum and maximum values of min_impurity_decrease and the number of values to try (default: (0., 0.00001, 5)).

  • n_folds (int) – The number of cross-validation folds to use for grid search (default: 5).

  • **kwargs – Arbitrary keyword arguments to be passed to the tree constructor.

Returns:

The trained decision tree classifier model.

Return type:

DecisionTreeClassifier

Example

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> tree = classifier.tree(alpha=(0., 0.001, 5), impurity=(0., 0.00001, 5), n_folds=3)
xgboost(n_estimators: Tuple[int, int, int] = (10, 100, 5), n_folds: int = 2, **kwargs) XGBClassifier[source]#

Train an XGBoost classifier using the training data and return the fitted model.

Parameters:
  • n_estimators (Tuple[int, int, int]) – The range of the number of boosting rounds. Default is (10, 100, 5).

  • n_folds (int) – The number of folds in cross-validation. Default is 2.

  • **kwargs – Arbitrary keyword arguments to be passed to the xgboost constructor.

Returns:

The trained XGBoost classifier.

Return type:

XGBClassifier

Example

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> xgboost = classifier.xgboost(n_estimators=(10, 100, 5), n_folds=2)
edamame.classifier.classification.classifier_metrics(model: LogisticRegression | GaussianNB | KNeighborsClassifier | DecisionTreeClassifier | RandomForestClassifier | XGBClassifier | SVC, X: DataFrame, y: DataFrame, cm: bool = False) None[source]#

Display classification metrics (confusion matrix and classification report) for the model passed as input to the function.

Parameters:
  • model (Union[LogisticRegression, GaussianNB, KNeighborsClassifier, DecisionTreeClassifier, RandomForestClassifier, XGBClassifier, SVC]) – Classification model.

  • X (pd.DataFrame) – Input features.

  • y (pd.DataFrame) – Target feature.

  • cm (bool) – Whether to display the confusion matrix. Defaults to False.

Returns:

None

diagnose#

class edamame.classifier.diagnose.ClassifierDiagnose(X_train: DataFrame, y_train: DataFrame, X_test: DataFrame, y_test: DataFrame)[source]#

Bases: object

A class for diagnosing classification models.

X_train#

The input training data.

Type:

pd.DataFrame

y_train#

The target training data.

Type:

pd.Series

X_test#

The input test data.

Type:

pd.DataFrame

y_test#

The target test data.

Type:

pd.Series

Examples

>>> from edamame.classifier import TrainClassifier
>>> classifier = TrainClassifier(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> nb = classifier.gaussian_nb()
>>> classifiers_diagnose = ClassifierDiagnose(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
>>> classifiers_diagnose.class_prediction_error(model=nb)
class_prediction_error(model: LogisticRegression | GaussianNB | KNeighborsClassifier | DecisionTreeClassifier | RandomForestClassifier | XGBClassifier | SVC, figsize: Tuple[float, float] = (8.0, 6.0), train_data: bool = True) None[source]#

This plot method shows the support (number of training samples) for each class in the fitted classification model as a stacked bar chart. Each bar is segmented to show the proportion of predictions (including false negatives and false positives, like a Confusion Matrix) for each class

Parameters:
  • model (Union[LogisticRegression, GaussianNB, KNeighborsClassifier, DecisionTreeClassifier, RandomForestClassifier, XGBClassifier, SVC]) – Classification model.

  • figsize (Tuple[float, float]) – Figure size for the plot. Defaults to (8, 6).

  • train_data (bool) – Defines if you want to plot the stacked barplot on train or test data (train by default).

Returns:

None

plot_roc_auc(model: LogisticRegression | GaussianNB | KNeighborsClassifier | DecisionTreeClassifier | RandomForestClassifier | XGBClassifier | SVC, figsize: Tuple[float, float] = (8.0, 6.0), train_data: bool = True) None[source]#

Method for plotting the ROC curve and calculating the AUC values for a given model.

Parameters:
  • model (Union[LogisticRegression, GaussianNB, KNeighborsClassifier, DecisionTreeClassifier, RandomForestClassifier, XGBClassifier, SVC]) – Classification model.

  • figsize (Tuple[float, float]) – Figure size for the plot. Defaults to (8, 6).

  • train_data (bool) – Defines if you want to plot the stacked barplot on train or test data (train by default).

Returns:

None

random_forest_fi(model: RandomForestClassifier, figsize: Tuple[float, float] = (8.0, 6.0)) None[source]#

Displays the feature importance plot of the random forest model.

Parameters:
  • model (RandomForestClassifier) – The input random forest model.

  • figsize (Tuple[float, float]) – Figure size for the plot. Defaults to (8, 6).

Returns:

None

xgboost_fi(model: XGBClassifier, figsize: Tuple[float, float] = (8.0, 6.0)) None[source]#

Displays the feature importance plot of the xgboost model.

Parameters:
  • model (XGBClassifier) – The input xgboost model.

  • figsize (Tuple[float, float]) – Figure size for the plot. Defaults to (8, 6).

Returns:

None

edamame.classifier.diagnose.check_random_forest(model: RandomForestClassifier) None[source]#

The function checks if the model passed is a random forest regression.

Parameters:

model (RandomForestClassifier) – The input model to be checked.

Raises:

TypeError – If the input model is not a random forest regression model.

Returns:

None

edamame.classifier.diagnose.check_xgboost(model: XGBClassifier) None[source]#

The function checks if the model passed is a xgboost regression.

Parameters:

model (xgb.XGBRegressor) – The input model to be checked.

Raises:

TypeError – If the input model is not an XGBoost regression model.

Returns:

None