Machine Learning

Breast Cancer Diagnosis with SVM in Python

Load the Breast Cancer Wisconsin dataset, train an SVM classifier, and visualize PCA-reduced decision regions using an AI data analyst.

What

This AI Data Analyst workflow loads the Breast Cancer Wisconsin dataset from scikit-learn and summarizes the class balance. It scales features, applies PCA to two components, and visualizes the 2D projection and decision regions. It trains an SVM classifier and reports accuracy, a classification report, and a confusion matrix.

Who

This is for data science learners and practitioners who want a compact, reproducible example of SVM classification on a standard medical dataset. It helps anyone who needs a reference for preprocessing, dimensionality reduction, and evaluation in a conversational, code-generating notebook workflow.

Tools

scikit-learn
pandas
numpy
matplotlib
seaborn

Outcomes

Class balance table showing malignant 212 and benign 357
2D PCA scatter plot with two visible clusters
SVM accuracy around 0.97 with a classification report
Confusion matrix plot with false positive and false negative counts

Quality Score

8/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

Task Completion: 2/2
Excellent
All requested steps are present: dataset load with class balance, scaling + 2D PCA plot, SVM training with accuracy and classification report, and a plotted confusion matrix.
Execution Correctness: 2/2
Excellent
Code is coherent and likely runnable end-to-end: imports are valid, variables are defined in order, and sklearn/pandas/seaborn usage is correct (including stratified split and pipeline scaling).
Output Quality: 2/3
Good
Outputs match expected outcomes: class counts (357 benign, 212 malignant), PCA scatter plot produced, accuracy 0.974 (~0.97), and confusion matrix plot generated; however, the assistant’s final confusion-matrix narrative is truncated/incomplete in the provided evidence.
Reasoning Quality: 1/2
Needs work
Reasoning is mostly sound (notes scaling importance, avoids leakage via pipeline), but it includes some speculative commentary about PCA separation without referencing concrete observed plot details, and the confusion-matrix explanation is cut off.
Reliability: 1/1
Excellent
Workflow is reasonably robust (pipeline prevents leakage; explicit labels for confusion matrix align with string targets), with no hallucinated APIs; minor fragility comes from relying on string label ordering but it is handled via explicit labels.

load breast cancer dataset (from sklearn) and show class balance

Thinking

Considering dataset loading

I need to explain briefly how to load and inspect a dataset. I'm thinking we should show the class balance as a final bare expression, but only one table is needed. It could make sense to return code that loads the dataset, builds a DataFrame, and displays its shape, columns, and head. However, since the user wants to focus on class balance, I should prioritize that and maybe include a final balance table after loading. I'll aim for a clear, minimal approach.Clarifying outputs in the notebook

I realize that the notebook can handle multiple outputs, so if we need several tables, we can show them sequentially using the display function. In this case, I'll focus on displaying the class balance alongside the shape and columns as tuple or list outputs, without any comma-separated expressions. I think it’s acceptable to end with one final bare expression, like the class balance table. So, I’ll load the dataset, create the DataFrame, display its shape, columns, and head, then show the class balance as the last step.

Let’s start by loading the sklearn breast cancer dataset into a DataFrame and showing the class balance as a small summary table.

Inspect dataset and class balance

from sklearn.datasets import load_breast_cancer
import pandas as pd
from IPython.display import display

data = load_breast_cancer(as_frame=True)

df = data.frame.copy()
df["target_name"] = df["target"].map({i: name for i, name in enumerate(data.target_names)})

df.shape
df.columns.tolist()
display(df.head())

balance = (
    df["target_name"]
    .value_counts()
    .rename_axis("class")
    .reset_index(name="count")
)

balance["percent"] = (balance["count"] / balance["count"].sum() * 100).round(2)

balance

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension	target_name
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890	malignant
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902	malignant
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758	malignant
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300	malignant
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678	malignant

5 rows × 32 columns

	class	count	percent
0	benign	357	62.74
1	malignant	212	37.26

The sklearn breast cancer dataset is loaded, and the class balance is moderately skewed toward benign cases.

Observed class distribution:

Benign: 357 samples, 62.74%
Malignant: 212 samples, 37.26%

This means the dataset is not perfectly balanced, but it is not extremely imbalanced either. If you later build a classifier, this mild imbalance is worth keeping in mind when evaluating performance.

scale features and reduce to 2 components with PCA, plot the result

Thinking

Clarifying PCA plotting steps

I’m noticing that the notebook state has a dataframe that’s overwritten—variables show a shape of 144x2 with Month and #Passengers, which aren't related to my current task. The output is from a breast cancer dataset, but now my focus has shifted to plotting PCA. I need to load the breast cancer dataset again or use a previous cell. I’ll follow a single-step approach to plot PCA by loading the dataset, scaling features, and plotting them with seaborn while excluding the target.Planning a scatter plot

I'm thinking about creating a scatter plot that focuses on only one chart. I want to include a title that mentions the explained variance, which adds context to the plot. It’s important to keep everything concise, so I’ll ensure the title is clear without being overly wordy. This should make the visual representation effective and informative. Let's keep it straightforward while delivering the key information!

#10

Let’s reload the breast cancer data cleanly, standardize the feature columns, reduce to 2 PCA components, and plot them by class.

#11

Plot PCA components

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = load_breast_cancer(as_frame=True)

df_bc = data.frame.copy()
df_bc["target_name"] = df_bc["target"].map({i: name for i, name in enumerate(data.target_names)})

X = df_bc.drop(columns=["target", "target_name"])
y = df_bc["target_name"]

X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plot_df = pd.DataFrame(X_pca, columns=["PC1", "PC2"])
plot_df["class"] = y.values

plt.figure(figsize=(8, 6))
sns.scatterplot(data=plot_df, x="PC1", y="PC2", hue="class", alpha=0.8)
plt.title(
    f"Breast Cancer PCA (2 Components)\n"
    f"Explained variance: PC1={pca.explained_variance_ratio_[0]:.2%}, "
    f"PC2={pca.explained_variance_ratio_[1]:.2%}"
)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(title="Class")
plt.tight_layout()
plt.show()

#12

#13

The PCA scatter plot was created without any visible error, so the scaling and 2-component reduction step appears to have worked. To support the plot with the actual numeric PCA results, here are the explained-variance values and a few transformed rows.

#14

Display PCA summary

from IPython.display import display

pca_summary = pd.DataFrame({
    "component": ["PC1", "PC2"],
    "explained_variance_ratio": pca.explained_variance_ratio_,
    "explained_variance_percent": (pca.explained_variance_ratio_ * 100).round(2)
})

display(pca_summary)
display(plot_df.head())

#15

	component	explained_variance_ratio	explained_variance_percent
0	PC1	0.442720	44.27
1	PC2	0.189712	18.97

#16

	PC1	PC2	class
0	9.192837	1.948583	malignant
1	2.387802	-3.768172	malignant
2	5.733896	-1.075174	malignant
3	7.122953	10.275589	malignant
4	3.935302	-1.948072	malignant

#17

The PCA result supports the original request well.

The features were standardized before PCA, which is the right choice here because the breast cancer variables are on very different scales. After scaling, PCA reduced the data to 2 components and produced coordinates for each sample in plot_df (PC1, PC2, and class label).

From the observed outputs:

PC1 explains 44.27% of the variance
PC2 explains 18.97% of the variance

Together, the first two components capture 63.24% of the total variation in the standardized feature space. That means the 2D PCA plot preserves a substantial amount of the structure in the original 30-dimensional data.

The transformed rows shown in plot_df.head() confirm that each observation was successfully mapped into the new 2-component space. The scatter plot colored by class is therefore a valid 2D PCA projection of the scaled data.

Interpreting the plot: with more than 60% of variance retained in the first two components, the visualization should give a meaningful summary of class structure, though not a perfect separation since over a third of the variance remains outside these two axes.

#18

train an SVM classifier and show accuracy and classification report

#19

Thinking

Clarifying SVM training steps

I'm considering how to present the results. There’s a possibility I won’t include a share prompt, and maybe I’ll just display the accuracy and report data directly. My focus is on the current step, so I’ll train an SVM using the original 30 features and scale them in a pipeline. I won't use the PCA coordinates for training unless that's explicitly requested. I’ll be importing necessary libraries for the process.

#20

I’d train the SVM on the original 30 features, with scaling inside a pipeline. I’m not using the 2 PCA components for the classifier here, because PCA was for visualization and the full feature set usually gives better classification performance.

#21

Train and evaluate SVM

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from IPython.display import display
import pandas as pd

X = df_bc.drop(columns=["target", "target_name"])
y = df_bc["target_name"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

svm_model = make_pipeline(
    StandardScaler(),
    SVC(kernel="rbf", random_state=42)
)

svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

accuracy_df = pd.DataFrame({
    "metric": ["accuracy"],
    "value": [accuracy_score(y_test, y_pred)]
})

report_df = pd.DataFrame(
    classification_report(y_test, y_pred, output_dict=True)
).T

display(accuracy_df)
display(report_df)

#22

	metric	value
0	accuracy	0.973684

#23

	precision	recall	f1-score	support
benign	0.960000	1.000000	0.979592	72.000000
malignant	1.000000	0.928571	0.962963	42.000000
accuracy	0.973684	0.973684	0.973684	0.973684
macro avg	0.980000	0.964286	0.971277	114.000000
weighted avg	0.974737	0.973684	0.973465	114.000000

#24

The SVM classifier performed very strongly on the test set.

Accuracy: 97.37%

From the classification report:

Benign
- Precision: 0.96
- Recall: 1.00
- F1-score: 0.98
- Support: 72
Malignant
- Precision: 1.00
- Recall: 0.93
- F1-score: 0.96
- Support: 42

What this means:

The model correctly classified almost all test samples.
It identified all benign cases in the test set, giving benign recall of 1.00.
For malignant cases, precision was perfect at 1.00, so every sample predicted as malignant was actually malignant.
Malignant recall of 0.93 means a small number of malignant cases were missed.

Overall, this is a strong SVM result, with balanced performance across both classes and especially high reliability in its malignant predictions.

#25

plot the confusion matrix

#26

Here’s the confusion matrix for the SVM predictions. Based on the classification report, it should show very few errors.

#27

Plot confusion matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

labels = ["benign", "malignant"]
cm = confusion_matrix(y_test, y_pred, labels=labels)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=labels, yticklabels=labels)
plt.title("SVM Confusion Matrix")
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.tight_layout()
plt.show()

#28

#29

Thinking

Analyzing confusion matrix requirements

I need to decide between two options for the user: whether to plot the confusion matrix or not. The latest output is a chart-only confusion matrix, but previous outputs provide a numeric classification report. I’m wondering if that’s enough since the original goal was to visualize the data. I think it’s important to have both a valid chart and numeric values supporting the conclusion. I can mention that the confusion matrix appears to have been plotted successfully, but I should clarify the counts inferred.Deciding between options for output

I’m thinking we should choose option B because, for a chart-only output, the code needs to display the underlying values before finalizing. Since the latest output is chart-only and doesn’t show confusion matrix values, I need to provide a minimal runnable code block that's different from the last executed step. I can include a DataFrame to display the confusion matrix and mention insights from the classification report, like accuracy and recall values, while avoiding unnecessary assumptions.

#30

The confusion matrix plot is already showing a strong result, and the earlier metrics support that: accuracy is about 97.37%, with perfect recall for benign and slightly lower recall for malignant. That means the matrix should contain very few off-diagonal errors, with any mistakes most likely coming from malignant cases predicted as benign.

To show the exact counts behind the plotted matrix, display it as a small table:

#31

Create confusion matrix dataframe

cm_df = pd.DataFrame(
    cm,
    index=["Actual benign", "Actual malignant"],
    columns=["Predicted benign", "Predicted malignant"]
)

cm_df

#32

	Predicted benign	Predicted malignant
Actual benign	72	0
Actual malignant	3	39

#33

Thinking

Crafting final answer

I need to provide an answer based on the notebook state, as the user originally requested a plot for the confusion matrix. Analyzing the latest outputs indicates that the latest output is just a chart and numeric values in the output-6 table. I should give a concise final answer, including a confusion matrix interpretation. This means mentioning the counts: 72 benign correctly predicted, 39 malignant correctly predicted, 3 malignant misclassified as benign, and no benign misclassified. I’ll also include accuracy.

#34

The confusion matrix shows the SVM is performing very well on this test set:

Benign: 72 correctly predicted as benign, 0 misclassified as malignant
Malignant: 39 correctly predicted as malignant, 3 misclassified as benign

So the model made 111 correct predictions out of 114, which matches the earlier accuracy of about 97.37%.

The main pattern in the matrix is that errors are rare, and when they happen they are malignant cases predicted as benign. There were no false alarms for benign cases.

Breast Cancer Diagnosis with SVM in Python

What

Who

Tools

Outcomes

Quality Score

Browse All Analysis Examples

Machine Learning

Time Series

Data Analysis

Exploratory Data Analysis

Finance

Nlp

Statistics