rapa package¶
Submodules¶
rapa.Project module¶
rapa.base module¶
- class rapa.base.RAPABase[source]¶
Bases:
object
The base of regression and classification RAPA analysis
- POSSIBLE_TARGET_TYPES = ['ALL', 'ANOMALY', 'BINARY', 'MULTICLASS', 'MULTILABEL', 'REGRESSION', 'UNSTRUCTURED']¶
_classification = None # Set by child classes
target_type = None # Set at initialization
project = None # Set at initialization or with ‘perform_parsimony()’
- create_submittable_dataframe(input_data_df: DataFrame, target_name: str, n_features: int = 19990, n_splits: int = 6, filter_function: Optional[Callable[[DataFrame, ndarray], List[ndarray]]] = None, random_state: Optional[int] = None) DataFrame [source]¶
Prepares the input data for submission as either a regression or classification problem on DataRobot.
Creates pre-determined k-fold cross-validation splits and filters the feature set down to a size that DataRobot can receive as input, if necessary. TODO: private function submit_datarobot_project explanation
- Parameters
- target_name: str
Name of the prediction target column in input_data_df.
- n_features: int, optional (default: 19990)
The number of features to reduce the feature set in input_data_df down to. DataRobot’s maximum feature set size is 20,000. If n_features has the same number of features as the input_data_df, NaN values are allowed because no feature filtering will ocurr
- n_splits: int, optional (default: 6)
The number of cross-validation splits to create. One of the splits will be retained as a holdout split, so by default this function sets up the dataset for 5-fold cross-validation with a holdout. NOTE: CV Fold 0 is the holdout set by default.
- filter_function: callable, optional (default: None)
The function used to calculate the importance of each feature in the initial filtering step that reduces the feature set down to max_features.
This filter function must take a feature matrix as the first input and the target array as the second input, then return two separate arrays containing the feature importance of each feature and the P-value for that correlation, in that order.
When None, the filter function is determined by child class. If an instance of RAPAClassif(), sklearn.feature_selection.f_classif is used. If RAPARegress(), sklearn.feature_selection.f_regression is used. See scikit-learn’s f_classif function for an example: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html
- random_state: int, optional (default: None)
The random number generator seed for RAPA. Use this parameter to make sure that RAPA will give you the same results each time you run it on the same input data set with that seed.
- Returns
pre-determined k-fold cross-validation splits, and was filtered down to ‘max_features’ size using the ‘filter_function’
- perform_parsimony(feature_range: List[Union[float, int]], project: Optional[Union[Project, str]] = None, starting_featurelist_name: str = 'Informative Features', featurelist_prefix: str = 'RAPA Reduced to', mode: str = 'auto', lives: Optional[int] = None, cv_average_mean_error_limit: Optional[float] = None, feature_impact_metric: str = 'median', progress_bar: bool = True, to_graph: Optional[List[str]] = None, metric: Optional[str] = None, verbose: bool = True)[source]¶
Performs parsimony analysis by repetatively extracting feature importance from DataRobot models and creating new models with reduced features (smaller feature lists). # TODO take a look at featurelist_prefix for running multiple RAPA
NOTICE: Feature impact scores are only gathered from models that have had their cross-validation accuracy tested!
- Parameters
or a list containing floats representing desired featurelist percentages (of the original featurelist size)
- project: datarobot.Project | str, optional (default = None)
Either a datarobot project, or a string of it’s id or name. If None, uses the project that was provided to create the rapa class
- starting_featurelist: str, optional (default = ‘Informative Features’)
The name or id of the featurelist that rapa will start pasimony analysis with
- featurelist_prefix: str, optional (default = ‘RAPA Reduced to’)
The desired prefix for the featurelists that rapa creates in datarobot. Each featurelist will start with the prefix, include a space, and then end with the number of features in that featurelist
- mode: str (enum), optional (default: datarobot.AUTOPILOT_MODE.FULL_AUTO)
The modeling mode to start the DataRobot project in. Options:
datarobot.AUTOPILOT_MODE.FULL_AUTO
datarobot.AUTOPILOT_MODE.QUICK
datarobot.AUTOPILOT_MODE.MANUAL
datarobot.AUTOPILOT_MODE.COMPREHENSIVE: Runs all blueprints in the repository (warning: this may be extremely slow).
- lives: int, optional (default = None)
The number of times allowed for reducing the featurelist and obtaining a worse model. By default, ‘lives’ are off, and the entire ‘feature_range’ will be ran, but if supplied a number >= 0, then that is the number of ‘lives’ there are.
Ex: lives = 0, feature_range = [100, 90, 80, 50] RAPA finds that after making all the models for the length 80 featurelist, the ‘best’ model was created with the length 90 featurelist, so it stops and doesn’t make a featurelist of length 50.
Similar to datarobot’s Feature Importance Rank Ensembling for advanced feature selection (FIRE) package’s ‘lifes’ https://www.datarobot.com/blog/using-feature-importance-rank-ensembling-fire-for-advanced-feature-selection/
- cv_average_mean_error_limit: float, optional (default = None)
The limit of cross validation mean error to help avoid overfitting. By default, the limit is off, and the each ‘feature_range’ will be ran. Limit exists only if supplied a number >= 0.0
- Ex: ‘feature_range’ = 2.5, feature_range = [100, 90, 80, 50]
RAPA finds that the average AUC for each CV fold is [.8, .6, .9, .5] respectfully, the mean of these is 0.7. The average error is += 0.15. If 0.15 >= cv_average_mean_error_limit, the training stops.
- feature_impact_metric: str, optional (default = ‘median’)
- How RAPA will decide each feature’s importance over every model in a feature list
Options: * median * mean * cumulative
- progress_bar: bool, optional (default = True)
If True, a simple progres bar displaying complete and incomplete featurelists. If False, provides updates in stdout Ex: current worker count, current featurelist, etc.
- to_graph: List[str], optional (default = None)
- A list of keys choosing which graphs to produce. Possible Keys:
‘models’: seaborn boxplot with model performances with provided metric
‘feature_performance’: matplotlib.pyplot stackplot of feature performances
- metric: str, optional (default = None)
The metric used for scoring models, when finding the ‘best’ model, and when plotting model performance
When None, the metric is determined by what class inherits from base. For instance, a RAPAClassif instance’s default is ‘AUC’, and RAPARegress is ‘R Squared’
- verbose: bool, optional (default = True)
If True, prints updates from DataRobot and rapa during parsimonious feature rduction
- Returns
- submit_datarobot_project(input_data_df: DataFrame, target_name: str, project_name: str, target_type: Optional[str] = None, worker_count: int = -1, metric: Optional[str] = None, mode: str = 'auto', random_state: Optional[int] = None) Project [source]¶
Submits the input data to DataRobot as a new modeling project.
It is suggested to prepare the input_data_df using the ‘create_submittable_dataframe’ function first with an instance of either RAPAClassif or RAPARegress.
- Parameters
- target_name: str
Name of the prediction target column in input_data_df.
- project_name: str
Name of the project in DataRobot.
- target_type: str (enum)
Indicator to DataRobot of whether the new modeling project should be a binary classification, multiclass classification, or regression project.
- Options:
datarobot.TARGET_TYPE.BINARY
datarobot.TARGET_TYPE.REGRESSION
datarobot.TARGET_TYPE.MULTICLASS
- worker_count: int, optional (default: -1)
The number of worker engines to assign to the DataRobot project. By default, -1 tells DataRobot to use all available worker engines.
- metric: str, optional (default: None)
Name of the metric to use for evaluating models. You can query the metrics available for the target by way of Project.get_metrics. If none is specified, then the default recommended by DataRobot is used.
- mode: str (enum), optional (default: datarobot.AUTOPILOT_MODE.FULL_AUTO)
The modeling mode to start the DataRobot project in.
- Options:
datarobot.AUTOPILOT_MODE.FULL_AUTO
datarobot.AUTOPILOT_MODE.QUICK
datarobot.AUTOPILOT_MODE.MANUAL
datarobot.AUTOPILOT_MODE.COMPREHENSIVE: Runs all blueprints in the repository (this may be extremely slow).
- random_state: int, optional (default: None)
The random number generator seed for DataRobot. Use this parameter to make sure that DataRobot will give you the same results each time you run it on the same input data set with that seed.
- Returns
rapa.config module¶
rapa.utils module¶
- rapa.utils.feature_performance_stackplot(project: Project, featurelist_prefix: str = 'RAPA Reduced to', starting_featurelist: Optional[str] = None, feature_impact_metric: str = 'median', metric: Optional[str] = None, vlines: bool = False)[source]¶
Utilizes matplotlib.pyplot.stackplot to show feature performance during parsimony analysis.
- Parameters
- featurelist_prefix: str, optional (default = ‘RAPA Reduced to’)
The desired prefix for the featurelists that will be used for plotting feature performance. Each featurelist will start with the prefix, include a space, and then end with the number of features in that featurelist
- starting_featurelist: str, optional (default = None)
The starting featurelist used for parsimony analysis. If None, only the featurelists with the desired prefix in featurelist_prefix will be plotted
- feature_impact_metric: str, optional (default = mean)
Which metric to use when finding the most representative feature importance of all models in the featurelist
- Options:
median
mean
cumulative
- metric: str, optional (default = ‘AUC’ or ‘RMSE’) [classification and regression]
Which metric to use when finding feature importance of each model
- vlines: bool, optional (default = False)
Whether to add vertical lines at the featurelist lengths or not, False by default
- Returns
- rapa.utils.find_project(project: str) Project [source]¶
Uses the DataRobot api to find a current project.
Uses datarobot.Project.get() and dr.Project.list() to test if ‘project’ is either an id or possibly a name of a project in DataRobot, then returns the project found.
- Parameters
- Returns
first/only project returned by searching by project name. Returns None if the list is empty.
- rapa.utils.get_best_model(project: Project, featurelist_prefix: Optional[str] = None, starred: bool = False, metric: Optional[str] = None, fold: str = 'crossValidation', highest: Optional[bool] = None) Model [source]¶
Attempts to find the ‘best’ model in a datarobot by searching cross validation scores of all the models in a supplied project. # TODO make dictionary for minimize/maximize
CURRENTLY SUPPORTS METRICS WHERE HIGHER = BETTER
Warning
Actually finding the ‘best’ model takes more than averageing cross validation scores, and it is suggested that the ‘best’ model is decided and starred in DataRobot. (Make sure ‘starred = True’ if starring the ‘best’ model)
Note
Some models may not have scores for the supplied fold because they were not run. These models are ignored by this function. Make sure all models of interest have scores for the fold being provided if those models should be considered.
- Parameters
- featurelist_prefix: str, optional (default = None)
The desired featurelist prefix used to search in for models using specific rapa featurelists
- starred: bool, optional (default = False)
If True, return the starred model. If there are more than one starred models, then warn the user and return the ‘best’ one
- metric: str, optional (default = ‘AUC’ or ‘RMSE’) [classification and regression]
What model metric to use when finding the ‘best’
- fold: str, optional (default = ‘crossValidation’)
- The fold of data used in DataRobot. Options are as follows:
[‘validation’, ‘crossValidation’, ‘holdout’, ‘training’, ‘backtestingScores’, ‘backtesting’]
- highest: bool, optional (default for classification = True, default for regression = False)
Whether to take the highest value (highest = True), or the lowest value (highest = False). Change this when assumed switch is
- Returns
from the provided datarobot project
- rapa.utils.get_featurelist(featurelist: str, project: Project) Featurelist [source]¶
Uses the DataRobot api to search for a desired featurelist.
Uses datarobot.Project.get_featurelists() to retrieve all the featurelists in the project. Then, it searches the list for id’s, and if it doesn’t find any, it searches the list again for names. Returns the first project it finds.
- Parameters
- project: datarobot.Project
The project that is being searched for the featurelist
- Returns
- rapa.utils.get_starred_model(project: Project, metric: Optional[str] = None, featurelist_prefix: Optional[str] = None) Model [source]¶
Alias for rapa.utils.get_best_model() but makes starred = True
- rapa.utils.initialize_dr_api(token_key: Optional[str] = None, file_path: str = 'data/dr-tokens.pkl', endpoint: str = 'https://app.datarobot.com/api/v2')[source]¶
Initializes the DataRobot API with a pickled dictionary created by the user.
Accesses a file that should be a pickled dictionary. This dictionary has the API token as the value to the provided token_key. Ex: {token_key: ‘API_TOKEN’}
- Parameters
- file_path: str, optional (default = ‘data/dr-tokens.pkl’)
Path to the pickled dictionary containing the API token
- endpoint: str, optional (default = ‘https://app.datarobot.com/api/v2’)
The endpoint is usually the URL you would use to log into the DataRobot Web User Interface
- rapa.utils.parsimony_performance_boxplot(project: Project, featurelist_prefix: str = 'RAPA Reduced to', starting_featurelist: Optional[str] = None, metric: Optional[str] = None, split: str = 'crossValidation', featurelist_lengths: Optional[list] = None)[source]¶
Uses seaborn’s boxplot function to plot featurelist size vs performance for all models that use that featurelist prefix. There is a different boxplot for each featurelist length. # TODO warn about multiple prefixes, try to use new prefixes
- Paremeters
- featurelist_prefix: str, optional (default = ‘RAPA Reduced to’)
The desired prefix for the featurelists that will be used for plotting parsimony performance. Each featurelist will start with the prefix, include a space, and then end with the number of features in that featurelist
- starting_featurelist: str, optional (default = None)
The starting featurelist used for parsimony analysis. If None, only the featurelists with the desired prefix in featurelist_prefix will be plotted
- metric: str, optional (default = ‘AUC’ or ‘RMSE’) [classification and regression]
The metric used for plotting accuracy of models
- split: str, optional (default = ‘crossValidation’)
What split’s performance to take from. Can be: [‘crossValidation’, ‘holdout’] TODO: i think it can be more, double check
- featurelist_lengths: list, optional (default = None)
A list of featurelist lengths to plot
- Returns