edamame.eda#

eda#

edamame.eda.eda.correlation_categorical(data: DataFrame) → None[source]#

The function performs the Chi-Square Test of Independence between categorical variables of the dataset.

Parameters:: data (pd.DataFrame) – A pandas DataFrame passed in input.
Returns:: None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'category_2': ['A2', 'A2', 'B2', 'B2', 'A2', 'B2']})
>>> eda.correlation_categorical(df)

edamame.eda.eda.correlation_pearson(data: DataFrame, threshold: float = 0.0) → None[source]#

The function performs the Pearson’s correlation between the columns pairs.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
threshold (float) – Only the correlation values higher than the threshold are shown in the matrix. A floating value set by default to 0.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'type': [0,1,0,1,0,1]})
>>> num_col, qual_col = eda.variables_type(df)
>>> eda.correlation_pearson(df, num_col)

edamame.eda.eda.correlation_phik(data: DataFrame, theory: bool = False) → None[source]#

Paper link: https://arxiv.org/pdf/1811.11440.pdf

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
theory (bool) – A boolean value for displaying insight into the theory of the Phik correlation index. By default is set to False.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'category_2': ['A2', 'A2', 'B2', 'B2', 'A2', 'B2']})
>>> eda.correlation_phik(df, theory=True)

edamame.eda.eda.describe_distribution(data: DataFrame) → None[source]#

The function display the result of the describe() method applied to a pandas dataframe, divided by numerical and object columns.

Parameters:: data (pd.DataFrame) – A pandas DataFrame passed in input.
Returns:: None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4]})
>>> eda.describe_distribution(df)

edamame.eda.eda.dimensions(data: DataFrame) → None[source]#

The function displays the number of rows and columns of a pandas dataframe passed.

Parameters:: data (pd.Dataframe) – A pandas DataFrame passed in input.
Returns:: None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4]})
>>> eda.dimensions(df)

edamame.eda.eda.drop_columns(data: DataFrame, col: List[str])[source]#

The function returns a pandas dataframe with the columns selected dropped.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
col (List[str]) – A list of strings containing the names of columns to drop.

Returns:

pd.DataFrame

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4], 'type': [0,1,0,1]})
>>> df = eda.drop_columns(df, col=["category", "type"])

edamame.eda.eda.handling_missing(data: DataFrame, col: List[str], missing_val: float | int = nan, method: List[str] = []) → DataFrame[source]#

The function returns a pandas dataframe with the columns selected modified to handle the NaN values. It’s easy to use after the execution of the missing function.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
col (List[str]) – A list of the names of the dataframe columns to handle.
missing_val (Union[float, int]) – The value that represents the NA in the columns passed. By default is equal to np.nan but can be set as other value like 0.
method (List[str]) – A list of the names of the methods (mean, median, most_frequent, drop) applied to the columns passed. By default, if nothing was indicated, the function applied the most_frequent method to all the columns passed. Indicating fewer methods than the names of the columns leads to an autocompletion with the most_frequent method.

Returns:

Return the processed dataframe

Return type:

pd.DataFrame

Example

>>> import edamame.eda as eda
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'category': ['0', '1', '0', '1', np.nan], 'value': [1, 2, 3, 4, np.nan], 'value_2': [-1,-2,0,0,0]})
>>> nan_quant, nan_qual, zero_col = eda.missing(df)
>>> df = eda.handling_missing(df, col = nan_quant, missing_val = np.nan, method = ['mean']*len(nan_quant)) # handle NaN for numerical columns
>>> df = eda.handling_missing(df, col = nan_qual, missing_val=np.nan, method=['most_frequent']*len(nan_qual)) # handle NaN for categorical columns
>>> df = eda.handling_missing(df, col = zero_col, missing_val=0, method=['mean']*len(zero_col)) # handle O for columns with too many zeros

edamame.eda.eda.identify_types(data: DataFrame) → Tuple[List[str], List[str]][source]#

The function display the result of the dtypes method.

Parameters:: data (pd.DataFrame) – A pandas DataFrame passed in input.
Returns:: A tuple contains a list with the numerical columns and a list with the categorical/object column.
Return type:: Tuple[List[str], List[str]]

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4]})
>>> quant_col, qual_col = eda.identify_types(data)

edamame.eda.eda.inspection(data: DataFrame, threshold: int = 10, bins: int = 50, figsize: Tuple[float, float] = (6.0, 4.0)) → None[source]#

The function displays an interactive plot for analysing the distribution of a variable based on the distinct cardinalities of the target variable.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
threshold (int) – A value for determining the maximum number of distinct cardinalities the target variable can have. By default is set to 10.
bins (int) – The number of bins used by the histograms. By default bins=50.
figsize (Tuple[float, float]) – A tuple to determine the plot size.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5]})
>>> eda.inspection(df)

edamame.eda.eda.interaction(data: DataFrame) → None[source]#

The function display an interactive plot for analysing relationships between numerical columns with a scatterplot.

Parameters:: data (pd.DataFrame) – A pandas DataFrame passed in input.
Returns:: None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5]})
>>> eda.interaction(df)

edamame.eda.eda.missing(data: DataFrame) → Tuple[List[str], List[str], List[str]][source]#

The function display the following elements:

A table with the percentage of NA record for every column.
A table with the percentage of 0 as a record for every column.
A table with the percentage of duplicate rows.
A list of lists that contains the name of the numerical columns with NA, the name of the categorical columns with NA and the name of the columns with 0 as a record.

Parameters:: data (pd.DataFrame) – A pandas DataFrame passed in input.
Returns:: A Tuple that contains the name of the numerical columns with NA, the name of the categorical columns with NA and the name of the columns with 0 as a record.
Return type:: Tuple[List[str], List[str], List[str]]

Example

>>> import edamame.eda as eda
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'category': ['0', '1', '0', '1', np.nan], 'value': [1, 2, 3, 4, np.nan]})
>>> nan_quant, nan_qual, zero_col = eda.missing(df)

edamame.eda.eda.modify_cardinality(data: DataFrame, col: List[str], threshold: List[int]) → DataFrame[source]#

The function returns a pandas dataframe with the cardinalities of the columns selected modified.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
col (List[str]) – A list of strings containing the names of columns for which we want to modify the cardinalities.
threshold (List[int]) – A list of integer values containing the threshold values for every variable.

Returns:

pd.DataFrame

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'type': [0,1,0,1,0,1]})
>>> df = eda.modify_cardinality(data_cpy, col = ['category'], threshold=[3])

edamame.eda.eda.num_to_categorical(data: DataFrame, col: List[str]) → DataFrame[source]#

The function returns a dataframe with the columns transformed into an “object”.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
col (List[str]) – A list of strings containing the names of columns to convert.

Returns:

Dataframe with numerical columns passed converted to categorical.

Return type:

pd.DataFrame

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['0', '1', '0', '1'], 'value': [1, 2, 3, 4]})
>>> df = eda.num_to_categorical(df, col=["category"])

edamame.eda.eda.num_variable_study(data: DataFrame, col: str, bins: int = 50, epsilon: float = 0.0001, theory: bool = False) → None[source]#

The function displays the following transformations of the variable col passed: log(x), sqrt(x), x^2, Box-cox, 1/x

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
col (str) – The name of the dataframe column to study.
bins (int) – The number of bins used by the histograms. By default bins=50.
epsilon (float) – A constant for handle non strictly positive variables. By default epsilon = 0.0001
theory (bool) – A boolean value for displaying insight into the transformations applied.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5]})
>>> eda.num_variable_study(df, 'value', bins = 50, theory=True)

edamame.eda.eda.plot_categorical(data: DataFrame, col: List[str]) → None[source]#

The function returns a sequence of tables and plots.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
col (List[str]) – A list of string containing the names of columns to plot.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4], 'type': [0,1,0,1]})
>>> num_col, qual_col = eda.variables_type(df)
>>> eda.plot_categorical(df, qual_col)

edamame.eda.eda.plot_numerical(data: DataFrame, col: List[str], bins: int = 50) → None[source]#

The function returns a sequence of tables and plots.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
col (List[str]) – A list of string containing the names of columns to plot.
bins (int) – Number of bins to use in the histogram plot.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4], 'type': [0,1,0,1]})
>>> num_col, qual_col = eda.variables_type(df)
>>> eda.plot_numerical(df, num_col, bins = 100)

edamame.eda.eda.split_and_scaling(data: DataFrame, target: str, minmaxscaler: bool = False) → Tuple[DataFrame, DataFrame][source]#

The function returns two pandas dataframes:

The regressor matrix X contains all the predictors for the model.

The series y contains the values of the response variable.

In addition, the function applies a step of standard scaling on the numerical columns of the X matrix.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
target (str) – The response variable column name.
minmaxscaler (bool) – Select the type of scaling to apply to the numerical columns.

Returns:

Return the regression matrix and the target column.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'target': ['A2', 'A2', 'B2', 'B2', 'A2', 'B2']})
>>> X, y = eda.split_and_scaling(df, 'target')

edamame.eda.eda.view_cardinality(data: DataFrame, col: List[str]) → None[source]#

The function especially helps study the cardinalities of the categorical variables.

Parameters:

data (pd.DataFrame) – A pandas DataFrame passed in input.
col (List[str]) – A list of strings containing the names of columns for which we want to show the number of unique values.

Returns:

None

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'value': [1, 2, 3, 4], 'type': [0,1,0,1]})
>>> num_col, qual_col = eda.variables_type(df)
>>> eda.view_cardinality(df, qual_col)

tools#

edamame.eda.tools.dataframe_review(data: DataFrame) → None[source]#

The function checks if the object passed is a Pandas dataframe.

Parameters:: data (pd.Dataframe) – A pandas DataFrame passed in input.
Raises:: TypeError – If the input DataFrame contains non-numerical columns.
Returns:: None

edamame.eda.tools.dummy_control(data: DataFrame) → None[source]#

The function checks if the Pandas dataframe passed is encoded with dummy or OHE.

Parameters:: data (pd.Dataframe) – A pandas DataFrame passed in input.
Raises:: TypeError – If the input DataFrame contains non-numerical columns.
Returns:: None

edamame.eda.tools.load_model(path: str)[source]#

The function load the model saved previously in the pickle format.

Parameters:: path (str) – Path to the model saved in .pkl

edamame.eda.tools.ohe(array: ndarray) → ndarray[source]#

Convert a NumPy array that represents the categorical label of the target variable and transform it using one-hot encoding.

Parameters:: array (np.ndarray) – The target variables passed in input.
Returns:: The one-hot encoded NumPy array.
Return type:: np.ndarray

Example

>>> import edamame.eda as eda
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> y = iris.target
>>> y_ohe = eda.ohe(y)

edamame.eda.tools.scaling(X: DataFrame, minmaxscaler: bool = False) → DataFrame[source]#

The function returns the normalised/standardized matrix.

Parameters:: X (pd.DataFrame) – The model matrix X/X_train/X_test

minmaxscaler (bool): Select the type of scaling to apply to the numerical columns. By default is setted to the StandardScaler. If minmaxscaler is set to True the numerical columns is trasfomed to [0,1].

Returns:: pd.DataFrame

Example

>>> import edamame.eda as eda
>>> X = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5]})
>>> X = eda.scaling(X)

edamame.eda.tools.setup(X: DataFrame, y: DataFrame, dummy: bool = False, seed: int = 42, size: float = 0.25) → Tuple[DataFrame, DataFrame, DataFrame, DataFrame][source]#

The function returns the following elements: X_train, y_train, X_test, y_test.

Parameters:

X (pd.DataFrame) – The model matrix X (features matrix).
y (pd.DataFrame) – The target variable.
dummy (bool) – If False, the function produces the OHE. If True, the dummy encoding.
seed (int) – Random seed to apply at the train_test_split function.
size (float) – Size of the test dataset.

Returns:

X_train, y_train, X_test, y_test.

Return type:

Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]

Example

>>> import edamame.eda as eda
>>> df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', 'C', 'A'], 'value': [1, 2, 3, 4, 4, 5], 'target': ['A2', 'A2', 'B2', 'B2', 'A2', 'B2']})
>>> X, y = eda.split_and_scaling(df, 'target')
>>> X_train, y_train, X_test, y_test = eda.setup(X, y)