API Reference¶

Classes¶

class rapa.Project.Classification(project: Optional[Project] = None)[source]¶

Bases: RAPABase

RAPA class meant for classification problems.

class rapa.Project.Regression(project: Optional[Project] = None)[source]¶

Bases: RAPABase

RAPA class meant for regression problems.

class rapa.base.RAPABase[source]¶

Bases: object

The base of regression and classification RAPA analysis

POSSIBLE_TARGET_TYPES = ['ALL', 'ANOMALY', 'BINARY', 'MULTICLASS', 'MULTILABEL', 'REGRESSION', 'UNSTRUCTURED']¶

_classification = None # Set by child classes
target_type = None # Set at initialization
project = None # Set at initialization or with ‘perform_parsimony()’

create_submittable_dataframe(input_data_df: DataFrame, target_name: str, n_features: int = 19990, n_splits: int = 6, filter_function: Optional[Callable[[DataFrame, ndarray], List[ndarray]]] = None, random_state: Optional[int] = None) → DataFrame[source]¶

Prepares the input data for submission as either a regression or classification problem on DataRobot.

Creates pre-determined k-fold cross-validation splits and filters the feature set down to a size that DataRobot can receive as input, if necessary. TODO: private function submit_datarobot_project explanation

Parameters

target_name: str
Name of the prediction target column in input_data_df.

n_features: int, optional (default: 19990)
The number of features to reduce the feature set in input_data_df down to. DataRobot’s maximum feature set size is 20,000. If n_features has the same number of features as the input_data_df, NaN values are allowed because no feature filtering will ocurr

n_splits: int, optional (default: 6)
The number of cross-validation splits to create. One of the splits will be retained as a holdout split, so by default this function sets up the dataset for 5-fold cross-validation with a holdout. NOTE: CV Fold 0 is the holdout set by default.

filter_function: callable, optional (default: None)
The function used to calculate the importance of each feature in the initial filtering step that reduces the feature set down to max_features.

This filter function must take a feature matrix as the first input and the target array as the second input, then return two separate arrays containing the feature importance of each feature and the P-value for that correlation, in that order.

When None, the filter function is determined by child class. If an instance of RAPAClassif(), sklearn.feature_selection.f_classif is used. If RAPARegress(), sklearn.feature_selection.f_regression is used. See scikit-learn’s f_classif function for an example: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html

random_state: int, optional (default: None)
The random number generator seed for RAPA. Use this parameter to make sure that RAPA will give you the same results each time you run it on the same input data set with that seed.

Returns

pre-determined k-fold cross-validation splits, and was filtered down to ‘max_features’ size using the ‘filter_function’

perform_parsimony(feature_range: List[Union[float, int]], project: Optional[Union[Project, str]] = None, starting_featurelist_name: str = 'Informative Features', featurelist_prefix: str = 'RAPA Reduced to', mode: str = 'auto', lives: Optional[int] = None, cv_average_mean_error_limit: Optional[float] = None, feature_impact_metric: str = 'median', progress_bar: bool = True, to_graph: Optional[List[str]] = None, metric: Optional[str] = None, verbose: bool = True)[source]¶

Performs parsimony analysis by repetatively extracting feature importance from DataRobot models and creating new models with reduced features (smaller feature lists). # TODO take a look at featurelist_prefix for running multiple RAPA

NOTICE: Feature impact scores are only gathered from models that have had their cross-validation accuracy tested!

Parameters

or a list containing floats representing desired featurelist percentages (of the original featurelist size)

project: datarobot.Project | str, optional (default = None)
Either a datarobot project, or a string of it’s id or name. If None, uses the project that was provided to create the rapa class

starting_featurelist: str, optional (default = ‘Informative Features’)
The name or id of the featurelist that rapa will start pasimony analysis with

featurelist_prefix: str, optional (default = ‘RAPA Reduced to’)
The desired prefix for the featurelists that rapa creates in datarobot. Each featurelist will start with the prefix, include a space, and then end with the number of features in that featurelist

mode: str (enum), optional (default: datarobot.AUTOPILOT_MODE.FULL_AUTO)
The modeling mode to start the DataRobot project in. Options:

datarobot.AUTOPILOT_MODE.FULL_AUTO

datarobot.AUTOPILOT_MODE.QUICK

datarobot.AUTOPILOT_MODE.MANUAL

datarobot.AUTOPILOT_MODE.COMPREHENSIVE: Runs all blueprints in the repository (warning: this may be extremely slow).

lives: int, optional (default = None)
The number of times allowed for reducing the featurelist and obtaining a worse model. By default, ‘lives’ are off, and the entire ‘feature_range’ will be ran, but if supplied a number >= 0, then that is the number of ‘lives’ there are.

Ex: lives = 0, feature_range = [100, 90, 80, 50] RAPA finds that after making all the models for the length 80 featurelist, the ‘best’ model was created with the length 90 featurelist, so it stops and doesn’t make a featurelist of length 50.

Similar to datarobot’s Feature Importance Rank Ensembling for advanced feature selection (FIRE) package’s ‘lifes’ https://www.datarobot.com/blog/using-feature-importance-rank-ensembling-fire-for-advanced-feature-selection/

cv_average_mean_error_limit: float, optional (default = None)
The limit of cross validation mean error to help avoid overfitting. By default, the limit is off, and the each ‘feature_range’ will be ran. Limit exists only if supplied a number >= 0.0

Ex: ‘feature_range’ = 2.5, feature_range = [100, 90, 80, 50]
RAPA finds that the average AUC for each CV fold is [.8, .6, .9, .5] respectfully, the mean of these is 0.7. The average error is += 0.15. If 0.15 >= cv_average_mean_error_limit, the training stops.

feature_impact_metric: str, optional (default = ‘median’)

How RAPA will decide each feature’s importance over every model in a feature list
Options: * median * mean * cumulative

progress_bar: bool, optional (default = True)
If True, a simple progres bar displaying complete and incomplete featurelists. If False, provides updates in stdout Ex: current worker count, current featurelist, etc.

to_graph: List[str], optional (default = None)

A list of keys choosing which graphs to produce. Possible Keys:

‘models’: seaborn boxplot with model performances with provided metric

‘feature_performance’: matplotlib.pyplot stackplot of feature performances

metric: str, optional (default = None)
The metric used for scoring models, when finding the ‘best’ model, and when plotting model performance

When None, the metric is determined by what class inherits from base. For instance, a RAPAClassif instance’s default is ‘AUC’, and RAPARegress is ‘R Squared’

verbose: bool, optional (default = True)
If True, prints updates from DataRobot and rapa during parsimonious feature rduction

Returns

submit_datarobot_project(input_data_df: DataFrame, target_name: str, project_name: str, target_type: Optional[str] = None, worker_count: int = -1, metric: Optional[str] = None, mode: str = 'auto', random_state: Optional[int] = None) → Project[source]¶

Submits the input data to DataRobot as a new modeling project.

It is suggested to prepare the input_data_df using the ‘create_submittable_dataframe’ function first with an instance of either RAPAClassif or RAPARegress.

Parameters

target_name: str
Name of the prediction target column in input_data_df.

project_name: str
Name of the project in DataRobot.

target_type: str (enum)
Indicator to DataRobot of whether the new modeling project should be a binary classification, multiclass classification, or regression project.

Options:

datarobot.TARGET_TYPE.BINARY

datarobot.TARGET_TYPE.REGRESSION

datarobot.TARGET_TYPE.MULTICLASS

worker_count: int, optional (default: -1)
The number of worker engines to assign to the DataRobot project. By default, -1 tells DataRobot to use all available worker engines.

metric: str, optional (default: None)
Name of the metric to use for evaluating models. You can query the metrics available for the target by way of Project.get_metrics. If none is specified, then the default recommended by DataRobot is used.

mode: str (enum), optional (default: datarobot.AUTOPILOT_MODE.FULL_AUTO)
The modeling mode to start the DataRobot project in.

Options:

datarobot.AUTOPILOT_MODE.FULL_AUTO

datarobot.AUTOPILOT_MODE.QUICK

datarobot.AUTOPILOT_MODE.MANUAL

datarobot.AUTOPILOT_MODE.COMPREHENSIVE: Runs all blueprints in the repository (this may be extremely slow).

random_state: int, optional (default: None)
The random number generator seed for DataRobot. Use this parameter to make sure that DataRobot will give you the same results each time you run it on the same input data set with that seed.

Returns

Utility Functions¶

rapa.utils.feature_performance_stackplot(project: Project, featurelist_prefix: str = 'RAPA Reduced to', starting_featurelist: Optional[str] = None, feature_impact_metric: str = 'median', metric: Optional[str] = None, vlines: bool = False)[source]¶

Utilizes matplotlib.pyplot.stackplot to show feature performance during parsimony analysis.

Parameters

featurelist_prefix: str, optional (default = ‘RAPA Reduced to’)
The desired prefix for the featurelists that will be used for plotting feature performance. Each featurelist will start with the prefix, include a space, and then end with the number of features in that featurelist

starting_featurelist: str, optional (default = None)
The starting featurelist used for parsimony analysis. If None, only the featurelists with the desired prefix in featurelist_prefix will be plotted

feature_impact_metric: str, optional (default = mean)
Which metric to use when finding the most representative feature importance of all models in the featurelist

Options:

median

mean

cumulative

metric: str, optional (default = ‘AUC’ or ‘RMSE’) [classification and regression]
Which metric to use when finding feature importance of each model

vlines: bool, optional (default = False)
Whether to add vertical lines at the featurelist lengths or not, False by default

Returns

rapa.utils.find_project(project: str) → Project[source]¶

Uses the DataRobot api to find a current project.

Uses datarobot.Project.get() and dr.Project.list() to test if ‘project’ is either an id or possibly a name of a project in DataRobot, then returns the project found.

Parameters

Returns

first/only project returned by searching by project name. Returns None if the list is empty.

rapa.utils.get_best_model(project: Project, featurelist_prefix: Optional[str] = None, starred: bool = False, metric: Optional[str] = None, fold: str = 'crossValidation', highest: Optional[bool] = None) → Model[source]¶

Attempts to find the ‘best’ model in a datarobot by searching cross validation scores of all the models in a supplied project. # TODO make dictionary for minimize/maximize

CURRENTLY SUPPORTS METRICS WHERE HIGHER = BETTER

Warning

Actually finding the ‘best’ model takes more than averageing cross validation scores, and it is suggested that the ‘best’ model is decided and starred in DataRobot. (Make sure ‘starred = True’ if starring the ‘best’ model)

Note

Some models may not have scores for the supplied fold because they were not run. These models are ignored by this function. Make sure all models of interest have scores for the fold being provided if those models should be considered.

Parameters

featurelist_prefix: str, optional (default = None)
The desired featurelist prefix used to search in for models using specific rapa featurelists

starred: bool, optional (default = False)
If True, return the starred model. If there are more than one starred models, then warn the user and return the ‘best’ one

metric: str, optional (default = ‘AUC’ or ‘RMSE’) [classification and regression]
What model metric to use when finding the ‘best’

fold: str, optional (default = ‘crossValidation’)

The fold of data used in DataRobot. Options are as follows:
[‘validation’, ‘crossValidation’, ‘holdout’, ‘training’, ‘backtestingScores’, ‘backtesting’]

highest: bool, optional (default for classification = True, default for regression = False)
Whether to take the highest value (highest = True), or the lowest value (highest = False). Change this when assumed switch is

Returns

from the provided datarobot project

rapa.utils.get_featurelist(featurelist: str, project: Project) → Featurelist[source]¶

Uses the DataRobot api to search for a desired featurelist.

Uses datarobot.Project.get_featurelists() to retrieve all the featurelists in the project. Then, it searches the list for id’s, and if it doesn’t find any, it searches the list again for names. Returns the first project it finds.

Parameters

project: datarobot.Project
The project that is being searched for the featurelist

Returns

rapa.utils.get_starred_model(project: Project, metric: Optional[str] = None, featurelist_prefix: Optional[str] = None) → Model[source]¶: Alias for rapa.utils.get_best_model() but makes starred = True

rapa.utils.initialize_dr_api(token_key: Optional[str] = None, file_path: str = 'data/dr-tokens.pkl', endpoint: str = 'https://app.datarobot.com/api/v2')[source]¶

Initializes the DataRobot API with a pickled dictionary created by the user.

Accesses a file that should be a pickled dictionary. This dictionary has the API token as the value to the provided token_key. Ex: {token_key: ‘API_TOKEN’}

Parameters

file_path: str, optional (default = ‘data/dr-tokens.pkl’)
Path to the pickled dictionary containing the API token

endpoint: str, optional (default = ‘https://app.datarobot.com/api/v2’)
The endpoint is usually the URL you would use to log into the DataRobot Web User Interface

rapa.utils.parsimony_performance_boxplot(project: Project, featurelist_prefix: str = 'RAPA Reduced to', starting_featurelist: Optional[str] = None, metric: Optional[str] = None, split: str = 'crossValidation', featurelist_lengths: Optional[list] = None)[source]¶

Uses seaborn’s boxplot function to plot featurelist size vs performance for all models that use that featurelist prefix. There is a different boxplot for each featurelist length. # TODO warn about multiple prefixes, try to use new prefixes

Paremeters

featurelist_prefix: str, optional (default = ‘RAPA Reduced to’)
The desired prefix for the featurelists that will be used for plotting parsimony performance. Each featurelist will start with the prefix, include a space, and then end with the number of features in that featurelist

starting_featurelist: str, optional (default = None)
The starting featurelist used for parsimony analysis. If None, only the featurelists with the desired prefix in featurelist_prefix will be plotted

metric: str, optional (default = ‘AUC’ or ‘RMSE’) [classification and regression]
The metric used for plotting accuracy of models

split: str, optional (default = ‘crossValidation’)
What split’s performance to take from. Can be: [‘crossValidation’, ‘holdout’] TODO: i think it can be more, double check

featurelist_lengths: list, optional (default = None)
A list of featurelist lengths to plot

Returns