Breast Cancer Diagnosis with SVM in Python

Load the Breast Cancer Wisconsin dataset, train an SVM classifier, and visualize PCA-reduced decision regions using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Breast Cancer Wisconsin dataset from scikit-learn and summarizes the class balance. It scales features, applies PCA to two components, and visualizes the 2D projection and decision regions. It trains an SVM classifier and reports accuracy, a classification report, and a confusion matrix.

Who this example is for

This is for data science learners and practitioners who want a compact, reproducible example of SVM classification on a standard medical dataset. It helps anyone who needs a reference for preprocessing, dimensionality reduction, and evaluation in a conversational, code-generating notebook workflow.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Class balance table showing malignant 212 and benign 357
  • 2D PCA scatter plot with two visible clusters
  • SVM accuracy around 0.97 with a classification report
  • Confusion matrix plot with false positive and false negative counts

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load breast cancer dataset (from sklearn) and show class balance
  2. 2scale features and reduce to 2 components with PCA, plot the result
  3. 3train an SVM classifier and show accuracy and classification report
  4. 4plot the confusion matrix

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Breast Cancer Diagnosis with SVM in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load breast cancer dataset (from sklearn) and show class balance
Thinking

Considering class balance plotting

I'm trying to figure out why the class balance isn't plotting correctly. It seems like we can only display one chart or table per step. Maybe I could show the head of the dataset while mentioning the shape and columns separately through code. I think it’s best to load the dataset into a DataFrame, check and display its shape, columns, and head first, then move on to showing the class balance next time. Let's keep it concise!Clarifying data display options

I'm thinking about how to display the data effectively. Maybe I could use a DataFrame and show the shape and columns, but I need to remember the default rule: one table per step. I can load the breast cancer dataset and display the DataFrame's head to keep things simple. I’ve sketched some code to load the data and show the shape, number of features, and columns, but I should focus on just displaying the head instead. Let's optimize this!

I’ll start by loading the sklearn breast cancer dataset into a DataFrame and doing the initial inspection first.

Load breast cancer data
from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer(as_frame=True)

X = data.data.copy()
y = data.target.copy()

df = X.copy()
df["target"] = y

df.head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

The dataset is already in the notebook, and the preview confirms df has 569 rows with a target column. To answer class balance, we just need the target counts.

Count target values
df["target"].value_counts().rename(index={0: data.target_names[0], 1: data.target_names[1]})
target
benign       357
malignant    212
Name: count, dtype: int64

The sklearn breast cancer dataset is in the notebook as a table with 569 rows and 30 features, plus the target label.

The class balance is:

  • benign: 357
  • malignant: 212

That’s about:

  • 62.7% benign
  • 37.3% malignant

So the classes are somewhat imbalanced, with more benign cases than malignant.

scale features and reduce to 2 components with PCA, plot the result
Thinking

Considering steps for PCA and plotting

I’m thinking I need to respond with one step only, focusing on scaling features and applying PCA—maybe plotting as well? The user asked to scale features and reduce to two components with PCA, then plot the result. It seems it's essential to follow a single, iterative process. While I could combine scaling and PCA with plotting, the developer’s guidance suggests one code block at a time. I need to decide if my first step should be to compute PCA and display its explained variance or to create the plot directly, given the user’s request.Visualizing PCA projection

I'm focused on visualizing the PCA projection, and since the features are already loaded, I need to analyze the latest outputs considering class balance. Using seaborn or matplotlib, I should ensure there's a clear title and axis labels without multiple bare expressions. I can end the code block with something straightforward, but it seems like plt.show() isn't a bare expression. I should definitely make sure the target column isn't mutated, keeping X and y separate for scaling. I’ll confirm the target names are correctly mapped based on the data.Clarifying target names and scaling

I remember the target names are ['malignant', 'benign'], and when I check the value counts, it should list malignant first if sorted in descending order. The output shows benign at 357 and malignant at 212, which is correct since 1 corresponds to benign. For plotting, I should use either pd.Categorical.from_codes(y, data.target_names) or mapping with y.map(...). Also, I need to mention scaling before applying PCA since the features are likely on different scales. Let’s keep this concise!

Since PCA is scale-sensitive, the next step is to standardize X, reduce it to 2 principal components, and plot those components colored by diagnosis.

Plot PCA components
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# scale features, then reduce to 2 principal components
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

pca_df = pd.DataFrame(X_pca, columns=["PC1", "PC2"])
pca_df["target"] = y.map({i: name for i, name in enumerate(data.target_names)})

plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=pca_df,
    x="PC1",
    y="PC2",
    hue="target",
    palette="Set1",
    alpha=0.8
)
plt.title(
    f"Breast Cancer PCA (2 components)\n"
    f"Explained variance: PC1={pca.explained_variance_ratio_[0]:.1%}, "
    f"PC2={pca.explained_variance_ratio_[1]:.1%}"
)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(title="Class")
plt.tight_layout()
plt.show()

The PCA plot was created successfully, so the 30 breast-cancer features were scaled and projected into 2 dimensions as requested. The chart title also includes the explained variance for PC1 and PC2, which is the key numeric summary needed to interpret the reduction.

One important interpretation point from the plotted result: the two classes are separated noticeably in the 2D PCA space, which means the first two principal components capture a meaningful amount of the structure related to diagnosis. At the same time, the point clouds are not perfectly disjoint, so a 2-component projection preserves a strong visual signal but does not fully separate benign and malignant cases.

Because the features were standardized first, variables measured on larger numeric scales did not dominate the PCA. That makes the 2D view a fair summary of the overall feature variation rather than a reflection of just a few high-magnitude columns.

train an SVM classifier and show accuracy and classification report
Thinking

Crafting classification report code

I need to use target_names in the classification report, and I think a clear code block and brief explanation will do. It should be runnable without any headers, ending with just one bare expression if necessary. I'll start with necessary imports from sklearn, integrate scalability into the pipeline, and make sure to include printing the accuracy score. It looks like target_names for the breast cancer dataset is right, showing benign and malignant accurately. Let's keep this concise and clear!

Since SVM is also scale-sensitive, the cleanest next step is to train it in a pipeline with StandardScaler, then evaluate on a held-out test set.

Train and evaluate SVM
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# scale inside the pipeline to avoid leakage
svm_model = make_pipeline(
    StandardScaler(),
    SVC(kernel="rbf", random_state=42)
)

# train and evaluate
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))
Accuracy: 0.9825

              precision    recall  f1-score   support

   malignant       0.98      0.98      0.98        42
      benign       0.99      0.99      0.99        72

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

The SVM classifier performed very well on the held-out test set.

  • Accuracy: 0.9825
  • Test set size: 114 samples

From the classification report:

  • Malignant

    • Precision: 0.98
    • Recall: 0.98
    • F1-score: 0.98
  • Benign

    • Precision: 0.99
    • Recall: 0.99
    • F1-score: 0.99

What this means for your goal:

  • The model is correctly classifying about 98.25% of test cases.
  • Performance is strong for both classes, not just overall accuracy.
  • Recall for malignant cases is 0.98, which is especially important because it means the model is identifying nearly all cancer-positive cases in the test split.
  • The very similar precision, recall, and F1 values suggest the classifier is well balanced and not overly favoring one class.

So based on the observed notebook output, the trained scaled RBF SVM is an excellent classifier for this dataset.

plot the confusion matrix
Thinking

Generating confusion matrix

I need to respond using one step and one code block. The key here is to utilize the notebook state efficiently. I’ll plot a confusion matrix from the existing SVM model using the provided y_test and y_pred. Seaborn or Matplotlib will work well for this, as they're my default visualization tools. It’s important to keep things concise, so I’ll focus on a clear representation in one chart per step.

The SVM is already trained, so the next step is just to visualize its test predictions with a confusion matrix.

Plot confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=data.target_names,
    yticklabels=data.target_names
)
plt.title("SVM Confusion Matrix")
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.tight_layout()
plt.show()
Thinking

Clarifying output priorities

I need to base my answer on the notebook's state and instruction priorities, focusing on the confusion matrix plot. The latest output is visual-only, so I should print the underlying numeric values first. I'll provide brief insights from that output and include one minimal runnable code block to gather any missing evidence.

Since y_true and y_pred exist, I’ll print confusion matrix values with labels. The accuracy report indicates strong performance, but I want to avoid assumptions. Overall, I can mention that there are only a few misclassifications.

The confusion matrix plot is already present, and the preceding SVM results show very strong test performance: accuracy is 0.9825, with both classes at about 0.98 to 0.99 precision and recall. That means the matrix should be dominated by the diagonal, with only a small number of errors.

To make the plot interpretable in exact counts, here are the underlying confusion-matrix values as a labeled table:

Create confusion matrix dataframe
cm_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    index=[f"Actual: {name}" for name in data.target_names],
    columns=[f"Predicted: {name}" for name in data.target_names]
)

cm_df
Predicted: malignant Predicted: benign
Actual: malignant 41 1
Actual: benign 1 71

The confusion matrix for the SVM shows only 2 misclassifications out of 114 test cases, which matches the strong accuracy seen earlier.

  • Actual malignant: 41 predicted correctly as malignant, 1 predicted as benign
  • Actual benign: 71 predicted correctly as benign, 1 predicted as malignant

So the matrix is:

  • Top-left: 41 true malignant predictions
  • Top-right: 1 malignant case missed
  • Bottom-left: 1 benign case flagged as malignant
  • Bottom-right: 71 true benign predictions

This means the plot is strongly concentrated on the diagonal, indicating the model separates the two classes very well.

Score cards for gpt-5.4 for Breast Cancer Diagnosis with SVM in Python

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All major requested steps were completed: class balance, PCA visualization, SVM training with accuracy/report, and confusion matrix.

  • Execution Correctness: 2/2

    Excellent

    The code shown is coherent and likely runnable, with appropriate imports, preprocessing, model training, and metric computation.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes well: class counts are correct, PCA plot is produced, accuracy is 0.9825, and the confusion matrix includes FP/FN counts.

  • Reasoning Quality: 2/2

    Excellent

    The explanations correctly note scaling before PCA/SVM, interpret the PCA separation, and summarize model performance appropriately.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and uses standard sklearn practices without obvious hallucinations or unsafe logic.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.