edamame.eda#

eda#

edamame.eda.eda.correlation_categorical(data: DataFrame) None[source]#

The function performs the Chi-Square Test of Independence between categorical variables of the dataset.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'category_2': ['A2', 'A2', 'B2', 'B2', 'A2', 'B2']})
>>> eda.correlation_categorical(df)
edamame.eda.eda.correlation_pearson(data: DataFrame, threshold: float = 0.0) None[source]#

The function performs the Pearson’s correlation between the columns pairs.

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • threshold (float) – Only the correlation values higher than the threshold are shown in the matrix. A floating value set by default to 0.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'type': [0,1,0,1,0,1]})
>>> num_col, qual_col = eda.variables_type(df)
>>> eda.correlation_pearson(df, num_col)
edamame.eda.eda.correlation_phik(data: DataFrame, theory: bool = False) None[source]#

Paper link: https://arxiv.org/pdf/1811.11440.pdf

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • theory (bool) – A boolean value for displaying insight into the theory of the Phik correlation index. By default is set to False.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'category_2': ['A2', 'A2', 'B2', 'B2', 'A2', 'B2']})
>>> eda.correlation_phik(df, theory=True)
edamame.eda.eda.describe_distribution(data: DataFrame) None[source]#

The function display the result of the describe() method applied to a pandas dataframe, divided by numerical and object columns.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4]})
>>> eda.describe_distribution(df)
edamame.eda.eda.dimensions(data: DataFrame) None[source]#

The function displays the number of rows and columns of a pandas dataframe passed.

Parameters:

data (pd.Dataframe) – A pandas DataFrame passed in input.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4]})
>>> eda.dimensions(df)
edamame.eda.eda.drop_columns(data: DataFrame, col: List[str])[source]#

The function returns a pandas dataframe with the columns selected dropped.

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • col (List[str]) – A list of strings containing the names of columns to drop.

Returns:

pd.DataFrame

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4], 'type': [0,1,0,1]})
>>> df = eda.drop_columns(df, col=["category", "type"])
edamame.eda.eda.handling_missing(data: DataFrame, col: List[str], missing_val: float | int = nan, method: List[str] = []) DataFrame[source]#

The function returns a pandas dataframe with the columns selected modified to handle the NaN values. It’s easy to use after the execution of the missing function.

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • col (List[str]) – A list of the names of the dataframe columns to handle.

  • missing_val (Union[float, int]) – The value that represents the NA in the columns passed. By default is equal to np.nan but can be set as other value like 0.

  • method (List[str]) – A list of the names of the methods (mean, median, most_frequent, drop) applied to the columns passed. By default, if nothing was indicated, the function applied the most_frequent method to all the columns passed. Indicating fewer methods than the names of the columns leads to an autocompletion with the most_frequent method.

Returns:

Return the processed dataframe

Return type:

pd.DataFrame

Example

>>> import edamame.eda as eda
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'category': ['0', '1', '0', '1', np.nan], 'value': [1, 2, 3, 4, np.nan], 'value_2': [-1,-2,0,0,0]})
>>> nan_quant, nan_qual, zero_col = eda.missing(df)
>>> df = eda.handling_missing(df, col = nan_quant, missing_val = np.nan, method = ['mean']*len(nan_quant)) # handle NaN for numerical columns
>>> df = eda.handling_missing(df, col = nan_qual, missing_val=np.nan, method=['most_frequent']*len(nan_qual)) # handle NaN for categorical columns
>>> df = eda.handling_missing(df, col = zero_col, missing_val=0, method=['mean']*len(zero_col)) # handle O for columns with too many zeros
edamame.eda.eda.identify_types(data: DataFrame) Tuple[List[str], List[str]][source]#

The function display the result of the dtypes method.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.

Returns:

A tuple contains a list with the numerical columns and a list with the categorical/object column.

Return type:

Tuple[List[str], List[str]]

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4]})
>>> quant_col, qual_col = eda.identify_types(data)
edamame.eda.eda.inspection(data: DataFrame, threshold: int = 10, bins: int = 50, figsize: Tuple[float, float] = (6.0, 4.0)) None[source]#

The function displays an interactive plot for analysing the distribution of a variable based on the distinct cardinalities of the target variable.

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • threshold (int) – A value for determining the maximum number of distinct cardinalities the target variable can have. By default is set to 10.

  • bins (int) – The number of bins used by the histograms. By default bins=50.

  • figsize (Tuple[float, float]) – A tuple to determine the plot size.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5]})
>>> eda.inspection(df)
edamame.eda.eda.interaction(data: DataFrame) None[source]#

The function display an interactive plot for analysing relationships between numerical columns with a scatterplot.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5]})
>>> eda.interaction(df)
edamame.eda.eda.missing(data: DataFrame) Tuple[List[str], List[str], List[str]][source]#

The function display the following elements:

  • A table with the percentage of NA record for every column.

  • A table with the percentage of 0 as a record for every column.

  • A table with the percentage of duplicate rows.

  • A list of lists that contains the name of the numerical columns with NA, the name of the categorical columns with NA and the name of the columns with 0 as a record.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.

Returns:

A Tuple that contains the name of the numerical columns with NA, the name of the categorical columns with NA and the name of the columns with 0 as a record.

Return type:

Tuple[List[str], List[str], List[str]]

Example

>>> import edamame.eda as eda
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'category': ['0', '1', '0', '1', np.nan], 'value': [1, 2, 3, 4, np.nan]})
>>> nan_quant, nan_qual, zero_col = eda.missing(df)
edamame.eda.eda.modify_cardinality(data: DataFrame, col: List[str], threshold: List[int]) DataFrame[source]#

The function returns a pandas dataframe with the cardinalities of the columns selected modified.

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • col (List[str]) – A list of strings containing the names of columns for which we want to modify the cardinalities.

  • threshold (List[int]) – A list of integer values containing the threshold values for every variable.

Returns:

pd.DataFrame

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'type': [0,1,0,1,0,1]})
>>> df = eda.modify_cardinality(data_cpy, col = ['category'], threshold=[3])
edamame.eda.eda.num_to_categorical(data: DataFrame, col: List[str]) DataFrame[source]#

The function returns a dataframe with the columns transformed into an “object”.

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • col (List[str]) – A list of strings containing the names of columns to convert.

Returns:

Dataframe with numerical columns passed converted to categorical.

Return type:

pd.DataFrame

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['0', '1', '0', '1'], 'value': [1, 2, 3, 4]})
>>> df = eda.num_to_categorical(df, col=["category"])
edamame.eda.eda.num_variable_study(data: DataFrame, col: str, bins: int = 50, epsilon: float = 0.0001, theory: bool = False) None[source]#

The function displays the following transformations of the variable col passed: log(x), sqrt(x), x^2, Box-cox, 1/x

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • col (str) – The name of the dataframe column to study.

  • bins (int) – The number of bins used by the histograms. By default bins=50.

  • epsilon (float) – A constant for handle non strictly positive variables. By default epsilon = 0.0001

  • theory (bool) – A boolean value for displaying insight into the transformations applied.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5]})
>>> eda.num_variable_study(df, 'value', bins = 50, theory=True)
edamame.eda.eda.plot_categorical(data: DataFrame, col: List[str]) None[source]#

The function returns a sequence of tables and plots.

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • col (List[str]) – A list of string containing the names of columns to plot.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4], 'type': [0,1,0,1]})
>>> num_col, qual_col = eda.variables_type(df)
>>> eda.plot_categorical(df, qual_col)
edamame.eda.eda.plot_numerical(data: DataFrame, col: List[str], bins: int = 50) None[source]#

The function returns a sequence of tables and plots.

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • col (List[str]) – A list of string containing the names of columns to plot.

  • bins (int) – Number of bins to use in the histogram plot.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4], 'type': [0,1,0,1]})
>>> num_col, qual_col = eda.variables_type(df)
>>> eda.plot_numerical(df, num_col, bins = 100)
edamame.eda.eda.split_and_scaling(data: DataFrame, target: str, minmaxscaler: bool = False) Tuple[DataFrame, DataFrame][source]#

The function returns two pandas dataframes:

  • The regressor matrix X contains all the predictors for the model.

  • The series y contains the values of the response variable.

In addition, the function applies a step of standard scaling on the numerical columns of the X matrix.

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • target (str) – The response variable column name.

  • minmaxscaler (bool) – Select the type of scaling to apply to the numerical columns.

Returns:

Return the regression matrix and the target column.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'target': ['A2', 'A2', 'B2', 'B2', 'A2', 'B2']})
>>> X, y = eda.split_and_scaling(df, 'target')
edamame.eda.eda.view_cardinality(data: DataFrame, col: List[str]) None[source]#

The function especially helps study the cardinalities of the categorical variables.

Parameters:
  • data (pd.DataFrame) – A pandas DataFrame passed in input.

  • col (List[str]) – A list of strings containing the names of columns for which we want to show the number of unique values.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4], 'type': [0,1,0,1]})
>>> num_col, qual_col = eda.variables_type(df)
>>> eda.view_cardinality(df, qual_col)

tools#

edamame.eda.tools.dataframe_review(data: DataFrame) None[source]#

The function checks if the object passed is a Pandas dataframe.

Parameters:

data (pd.Dataframe) – A pandas DataFrame passed in input.

Raises:

TypeError – If the input DataFrame contains non-numerical columns.

Returns:

None

edamame.eda.tools.dummy_control(data: DataFrame) None[source]#

The function checks if the Pandas dataframe passed is encoded with dummy or OHE.

Parameters:

data (pd.Dataframe) – A pandas DataFrame passed in input.

Raises:

TypeError – If the input DataFrame contains non-numerical columns.

Returns:

None

edamame.eda.tools.load_model(path: str)[source]#

The function load the model saved previously in the pickle format.

Parameters:

path (str) – Path to the model saved in .pkl

edamame.eda.tools.ohe(array: ndarray) ndarray[source]#

Convert a NumPy array that represents the categorical label of the target variable and transform it using one-hot encoding.

Parameters:

array (np.ndarray) – The target variables passed in input.

Returns:

The one-hot encoded NumPy array.

Return type:

np.ndarray

Example

>>> import edamame.eda as eda
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> y = iris.target
>>> y_ohe = eda.ohe(y)
edamame.eda.tools.scaling(X: DataFrame, minmaxscaler: bool = False) DataFrame[source]#

The function returns the normalised/standardized matrix.

Parameters:

X (pd.DataFrame) – The model matrix X/X_train/X_test

minmaxscaler (bool): Select the type of scaling to apply to the numerical columns. By default is setted to the StandardScaler. If minmaxscaler is set to True the numerical columns is trasfomed to [0,1].

Returns:

pd.DataFrame

Example

>>> import edamame.eda as eda
>>> X = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5]})
>>> X = eda.scaling(X)
edamame.eda.tools.setup(X: DataFrame, y: DataFrame, dummy: bool = False, seed: int = 42, size: float = 0.25) Tuple[DataFrame, DataFrame, DataFrame, DataFrame][source]#

The function returns the following elements: X_train, y_train, X_test, y_test.

Parameters:
  • X (pd.DataFrame) – The model matrix X (features matrix).

  • y (pd.DataFrame) – The target variable.

  • dummy (bool) – If False, the function produces the OHE. If True, the dummy encoding.

  • seed (int) – Random seed to apply at the train_test_split function.

  • size (float) – Size of the test dataset.

Returns:

X_train, y_train, X_test, y_test.

Return type:

Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'target': ['A2', 'A2', 'B2', 'B2', 'A2', 'B2']})
>>> X, y = eda.split_and_scaling(df, 'target')
>>> X_train, y_train, X_test, y_test = eda.setup(X, y)