Feature EngineeringBeginnerPrompt
01
Extract all useful features from the date and datetime columns in this dataset.
For each date column, create:
- year, month, day, day_of_week (0=Monday), day_of_year
- quarter, week_of_year
- is_weekend (boolean)
- is_month_start, is_month_end (boolean)
- is_quarter_start, is_quarter_end (boolean)
- days_since_epoch (numeric, for ordinal encoding)
- If the column is a datetime: hour, minute, part_of_day (morning/afternoon/evening/night)
Also compute time-difference features if multiple date columns exist:
- days_between_[col1]_and_[col2] for all meaningful pairs
Return the feature creation code in pandas and a list of all new column names created.
Feature EngineeringAdvancedPrompt
02
Generate numeric features from the text columns in this dataset for use in a machine learning model.
For each text column:
1. Basic statistical features: character count, word count, sentence count, average word length, punctuation count
2. Lexical features: unique word ratio (vocabulary richness), stopword ratio, uppercase ratio
3. Sentiment features: positive score, negative score, neutral score, compound score using VADER
4. TF-IDF features: top 50 unigrams and top 20 bigrams (sparse matrix)
5. Dense embedding: use sentence-transformers (all-MiniLM-L6-v2) to produce a 384-dimensional embedding, then reduce to 10 dimensions using UMAP or PCA
Return code for each feature group as a modular function.
Note which features are suitable for tree models vs neural networks.
Feature EngineeringAdvancedChain
05
Step 1: Profile the raw features โ types, missing rates, cardinality, correlation with {{target_variable}}. Identify the weakest features (near-zero variance, low target correlation).
Step 2: Clean and encode โ impute missing values, encode categoricals (ordinal for low-cardinality, target encoding for high-cardinality), scale numerics.
Step 3: Engineer new features โ create interaction features, lag features if time-ordered, group aggregations, and domain-specific features based on the dataset context.
Step 4: Select features โ use SHAP values from a quick LightGBM model to rank all features. Drop features with SHAP importance below a threshold.
Step 5: Check for leakage โ verify no feature uses future information. Check correlation of each feature with the target is not suspiciously perfect (>0.95).
Step 6: Output a final feature list with: name, description, type, importance rank, and the code to reproduce it end-to-end.
Feature EngineeringIntermediatePrompt
06
Create group-level aggregation features by computing statistics at the level of each categorical group.
For each meaningful categorical column in the dataset:
1. Group by that column and compute these statistics for each numeric column:
- mean, median, std, min, max
- count of rows in the group
- percentile rank of each row within its group
- deviation of each row from its group mean (row_value - group_mean)
- ratio of each row to its group mean (row_value / group_mean)
2. Name features systematically: [numeric_col]_[statistic]_by_[group_col]
Example: revenue_mean_by_region, revenue_rank_by_region
3. Flag any group with fewer than 10 members โ statistics on tiny groups are unreliable
Return code using pandas groupby + transform, and a list of all features created.
Feature EngineeringIntermediatePrompt
08
Create lag and rolling window features for this time-ordered dataset.
Assume the data is ordered by {{date_column}} with one row per {{entity_column}} per time period.
Create per entity:
- Lag features: value at t-1, t-2, t-3, t-7, t-14, t-28 periods back
- Rolling mean: 7-period, 14-period, 28-period window
- Rolling standard deviation: 7-period and 28-period window
- Rolling min and max: 7-period window
- Exponentially weighted moving average (alpha=0.3)
- Trend: slope of a linear regression fitted on the last 7 values
Critical: ensure no data leakage โ all features must use only information available at prediction time (strictly historical).
Return the feature creation code and confirm the leakage-free construction.
Feature EngineeringBeginnerPrompt
09
Implement missing value imputation for machine learning on this dataset.
1. Profile missing values: count, percentage, and missingness pattern (MCAR, MAR, or MNAR) for each column
2. Implement and compare three imputation strategies:
a. Simple imputation: median for numeric, mode for categorical
b. KNN imputation: k=5 nearest neighbors based on complete features
c. Iterative imputation (MICE): model each feature as a function of others, iterate until convergence
3. Evaluate each strategy by: artificially masking 10% of known values and measuring reconstruction error (RMSE)
4. Add missingness indicator columns (is_missing_[col]) for columns with more than 5% missing โ these can be predictive features
5. Always fit imputation on training data only, then apply to validation and test sets
Return: comparison table of imputation strategies, code for the best strategy, and list of missingness indicator columns created.
Feature EngineeringAdvancedPrompt
10
Create polynomial and spline features to capture non-linear relationships in this dataset.
1. Identify the top 5 numeric features by correlation with {{target_variable}}
2. For each, test whether the relationship is linear, quadratic, or higher-order:
- Fit linear, quadratic, and cubic regression
- Compare Rยฒ values and plot each fit
3. For features with non-linear relationships:
a. Add polynomial features (degree 2 and 3)
b. Add natural cubic spline features with 4 knots at the 25th, 50th, 75th, and 90th percentiles
4. Add the polynomial/spline features to the model and compare:
- CV score before adding
- CV score after adding
- Risk of overfitting (train vs val gap)
5. Use SHAP to verify the model is using the polynomial features meaningfully
Return: relationship type table, feature code, and CV performance comparison.