Portfolio Optimization in Python

Build an efficient frontier, compute the Sharpe-optimal portfolio, and visualize portfolio weights using Monte Carlo simulation.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow downloads two years of daily prices for AAPL, MSFT, GOOGL, AMZN, and JPM (or generates correlated synthetic returns if needed) and computes daily returns. It calculates and visualizes the return correlation matrix across the five assets. It runs a Monte Carlo simulation of 5,000 random portfolios to build an efficient frontier, then identifies the maximum Sharpe ratio portfolio and visualizes the resulting weights.

Who this example is for

This is for analysts and students who want a reproducible, code-generating example of mean-variance portfolio optimization in Python. It helps users compare risk, return, and diversification effects across multiple stocks and extract an interpretable set of portfolio weights.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Price DataFrame for the selected tickers over ~2 years
Daily returns and a 5x5 correlation heatmap
Efficient frontier scatter plot of risk vs return colored by Sharpe ratio
Maximum Sharpe portfolio metrics (Sharpe, expected return, volatility) and printed optimal weights
Portfolio weights visualization (pie chart)

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

yfinance
pandas
numpy
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1download 2 years of daily prices for AAPL, MSFT, GOOGL, AMZN, JPM with yfinance
2compute daily returns and show the correlation matrix
3run Monte Carlo simulation with 5000 random portfolios and plot the efficient frontier
4identify the max Sharpe ratio portfolio and print the optimal weights

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation
gemma4:31b	5/10	1/2	1/2	2/3	1/2	0/1	Open gemma4:31b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gemma4:31b

This run uses AI with a step-by-step list of prompts to perform the analysis task Portfolio Optimization in Python. Below, you can review the full conversation generated with gemma4:31b. After the conversation, we show detailed score cards for this run.

download 2 years of daily prices for AAPL, MSFT, GOOGL, AMZN, JPM with yfinance

Thinking

Goal: Download 2 years of daily prices for AAPL, MSFT, GOOGL, AMZN, JPM using yfinance.
- Tickers: AAPL, MSFT, GOOGL, AMZN, JPM.
- Duration: 2 years.
- Frequency: Daily.
- Need yfinance library.
- Use yf.download.
- Need to calculate the start and end dates (today and 2 years ago).
- Store the data in a DataFrame.
- Initial inspection is required (shape, columns, head) as per the rules for first-load data.
- Import yfinance and datetime.
- Define tickers list.
- Define end date (today) and start date (2 years ago).
- Fetch data.
- Display df.shape, df.columns, and df.head().
- One code block? Yes.
- No $ for currency? N/A here (prices aren't in prose yet).
- No comma-separated expressions? (Use display or separate lines).
- End with a bare expression.
- No full pipeline? Yes, just loading.
- Simple variable names? Yes.

I'll start by downloading the historical price data for these five tickers using yfinance and then inspect the resulting dataset.

Download stock prices

import yfinance as yf
from datetime import datetime, timedelta

# Define tickers and time range
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'JPM']
end_date = datetime.today()
start_date = end_date - timedelta(days=2*365)

# Download daily closing prices
df = yf.download(tickers, start=start_date, end=end_date)['Close']

# Inspection
print(f"Shape: {df.shape}")
display(df.head())

[*********************100%***********************]  5 of 5 completed

Shape: (500, 5)

Ticker	AAPL	AMZN	GOOGL	JPM	MSFT
Date
2024-04-12	174.957733	186.130005	156.437042	175.238770	415.522247
2024-04-15	171.132553	183.619995	153.590591	175.334671	407.387115
2024-04-16	167.852386	183.320007	153.134354	173.330978	408.312897
2024-04-17	166.484848	181.279999	154.195572	172.640717	405.614319
2024-04-18	165.533493	179.220001	154.731155	173.762421	398.158783

Thinking

The user wants to download 2 years of daily prices for AAPL, MSFT, GOOGL, AMZN, JPM. The latest output shows that yf.download was called for these tickers, resulting in a DataFrame df with 500 rows and 5 columns (the 'Close' prices). The output shows the first 5 rows of data. The goal was "download 2 years of daily prices...". The data is downloaded, stored in df, and the inspection shows it contains the expected columns and a timeframe spanning approximately 2 years (500 trading days is roughly 2 years).

Decision: Path A. The data has been downloaded and verified.

Constraint Checklist & Confidence Score:

Use Markdown? Yes.
No $ for currency? Yes.
Python notebook environment? Yes.
Iterative loop? Yes.
One execution step? N/A (Final answer).
No multiple code blocks? N/A.
ML training safeguards? N/A.
Visualization defaults? N/A.
AutoML defaults? N/A.
Concise and conversational? Yes.
No rigid templates? Yes.
No "run this and share"? Yes.
Use latest outputs? Yes.
Path A/B decision? Path A.

Insight: The dataset contains 500 daily closing prices for the five requested stocks, covering the period from April 2022 to April 2024.

Final Answer content: The historical closing prices for AAPL, MSFT, GOOGL, AMZN, and JPM have been successfully downloaded. The resulting dataset contains 500 trading days of data, which aligns with the 2-year timeframe. For instance, as of April 12, 2024, the prices were 174.96 USD for AAPL and 415.52 USD for MSFT.

The historical closing prices for AAPL, MSFT, GOOGL, AMZN, and JPM have been successfully downloaded. The dataset contains 500 trading days of data, covering the requested 2-year period. As of April 12, 2024, the prices were 174.96 USD for AAPL, 186.13 USD for AMZN, 156.44 USD for GOOGL, 175.24 USD for JPM, and 415.52 USD for MSFT.

#10

compute daily returns and show the correlation matrix

#11

Thinking

The user wants to compute daily returns for the stocks in the df DataFrame and then visualize the correlation matrix.

The current state has a DataFrame df with closing prices for AAPL, AMZN, GOOGL, JPM, and MSFT.
To compute daily returns, I should use the .pct_change() method.
Then, I'll compute the correlation matrix using .corr().
Finally, I'll plot the correlation matrix using a seaborn heatmap.

Wait, the instructions say "one primary objective per response" and "Return at most one Python code block per response".

Step 1: Compute daily returns and the correlation matrix. Step 2 (potentially separate or combined if it's a single "analytical" task): Show the correlation matrix.

Actually, computing returns and the matrix is the numerical part, and plotting is the visual part. I'll combine the computation and the correlation matrix table first, or I can do both computation and a heatmap in one go since "show the correlation matrix" usually implies a visual or a table. Let's stick to computing returns and displaying the correlation matrix table first, or better yet, a heatmap as it's the standard way to "show" a correlation matrix.

Wait, the rules: "Keep each step small, clear, and beginner-friendly." and "Default to one chart OR one table per step."

Plan:

Compute daily percentage changes.
Compute the correlation matrix.
Plot the correlation matrix using a seaborn heatmap.

Let's do the computation and visualization in one step as it's a single conceptual goal ("show the correlation matrix").

Code logic:

returns = df.pct_change()
corr_matrix = returns.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

Refined plan:

Calculate daily returns using pct_change().
Calculate the correlation matrix.
Use seaborn to plot the heatmap.

#12

I'll compute the daily percentage returns for each stock and then visualize the correlation matrix using a heatmap to see how these assets move together.

#13

Plot returns correlation matrix

import seaborn as sns
import matplotlib.pyplot as plt

# Compute daily returns
returns = df.pct_change()

# Compute correlation matrix
corr_matrix = returns.corr()

# Plot correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Daily Returns Correlation Matrix')
plt.show()

#14

#15

Thinking

The user wants to compute daily returns and show the correlation matrix for AAPL, MSFT, GOOGL, AMZN, and JPM.

Looking at the notebook state:

cell-1 downloaded the prices.
cell-2 computed returns and corr_matrix, and plotted a heatmap.
The heatmap is rendered (<Figure size 1000x800 with 2 Axes>).
However, the numerical values of the correlation matrix are not printed in the output text; they are only inside the heatmap.
According to the "Sufficiency heuristics": "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

Therefore, I need to print the corr_matrix DataFrame to provide the concrete numeric evidence before finalizing.

Path B: Provide insights from the heatmap (though it's just a figure object in the logs, I should acknowledge the a correlation matrix was generated) and provide a code block to display the numeric correlation matrix.

Wait, the user's goal was "compute daily returns and show the correlation matrix". The heatmap is showing it, but for a data analyst's final answer, having the raw numbers is better. But the prompt says "If the goal is to show/plot/chart/visualize and latest output contains both a valid chart/figure object and explicit numeric/text values... choose Path A". Here we only have the figure.

Actually, looking at the prompt: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

I will print the corr_matrix.

#16

The daily returns have been calculated and visualized. The heatmap indicates the relative strength of the linear relationship between the stock price movements, with values closer to 1.00 showing strong positive correlation. To provide precise values for analysis, I will display the numeric correlation matrix.

#17

Display correlation matrix

display(corr_matrix)

#18

Ticker	AAPL	AMZN	GOOGL	JPM	MSFT
Ticker
AAPL	1.000000	0.475730	0.429268	0.354348	0.426278
AMZN	0.475730	1.000000	0.548688	0.421693	0.561725
GOOGL	0.429268	0.548688	1.000000	0.310700	0.420506
JPM	0.354348	0.421693	0.310700	1.000000	0.326599
MSFT	0.426278	0.561725	0.420506	0.326599	1.000000

#19

Thinking

The user's original goal was to "compute daily returns and show the correlation matrix".

Looking at the notebook state:

Cell 1 downloaded the data.
Cell 2 computed returns = df.pct_change() and plotted the correlation matrix using a heatmap.
Cell 3 displayed the corr_matrix dataframe.

Outputs present:

A heatmap plot (Figure).
A numeric DataFrame (corr_matrix) showing the correlations between AAPL, AMZN, GOOGL, JPM, and MSFT.

Analysis of results:

The daily returns were successfully computed.
The correlation matrix was both visualized and printed numerically.
Insights: MSFT and AMZN show the strongest positive correlation (0.56), while GOOGL and JPM show the weakest (0.31). All stocks in this set show positive correlations, which is typical for large-cap equities in the same general market.

The goal has been fully met. No further code is needed. I should follow Path A.

#20

The daily returns have been computed, and the correlation matrix reveals that all five stocks move in the same general direction (positive correlation). The strongest relationship is between MSFT and AMZN (0.56), while the weakest is between GOOGL and JPM (0.31), suggesting that the tech stocks in this group are more closely synchronized with each other than with the financial sector representative.

#21

run Monte Carlo simulation with 5000 random portfolios and plot the efficient frontier

#22

Thinking

The user wants to run a Monte Carlo simulation with 5000 random portfolios and plot the Efficient Frontier.

*   `returns`: A DataFrame of daily returns for 5 tickers (AAPL, AMZN, GOOGL, JPM, MSFT).
*   `corr_matrix`: The correlation matrix of these returns.
*   `df`: The closing prices.

*   Need expected returns (mean of `returns`).
*   Need the covariance matrix (covariance of `returns`).
*   Need a way to generate random weights.
*   Calculate portfolio return and volatility (risk) for each set of weights.
*   Plot the results (Return vs. Volatility) to visualize the Efficient Frontier.

*   Import `numpy` for random weight generation.
*   Calculate average returns and covariance matrix.
*   Run a loop 5000 times:
    *   Generate random weights.
    *   Normalize weights so they sum to 1.
    *   Calculate portfolio return: $\sum (weights \times mean\_returns)$.
    *   Calculate portfolio volatility: $\sqrt{weights^T \times Cov \times weights}$.
*   Store results in a DataFrame.
*   Plot using `seaborn` or `matplotlib`.

*   Step 1: Calculate the necessary statistics (annualized returns and covariance).
*   Step 2: Run the loop for simulations.
*   Step 3: Plot.

*Since I must provide one runnable code block and prefer one execution step, I'll combine the simulation and the plotting into one logical step for the "Efficient Frontier" request, as the simulation is the preparation for the plot.*

*Refinement on annualization:* Trading days are usually ~252.
*   Annual Return = Mean Daily Return * 252.
*   Annual Covariance = Daily Covariance * 252.

*Plan:*
1. Import `numpy`.
2. Compute annualized returns and covariance.
3. Simulate 5000 portfolios.
4. Plot the scatter plot of Risk vs Return.

#23

To simulate the efficient frontier, I will generate 5,000 random portfolios by assigning random weights to each asset, calculating the resulting annual return and volatility (risk), and then plotting these on a risk-return scatter plot.

#24

Plot efficient frontier

import numpy as np

# Parameters for simulation
num_portfolios = 5000
trading_days = 252

# Calculate annualized returns and covariance matrix
annual_returns = returns.mean() * trading_days
cov_matrix = returns.cov() * trading_days

# Initialize arrays to store results
port_returns = []
port_volatility = []

# Monte Carlo Simulation
for _ in range(num_portfolios):
    # Generate random weights and normalize to sum to 1
    weights = np.random.random(len(tickers))
    weights /= np.sum(weights)
    
    # Calculate expected portfolio return
    p_return = np.dot(weights, annual_returns)
    
    # Calculate portfolio volatility (standard deviation)
    p_volatility = np.sqrt(np.dot(weights.T, np.dot(cov_matrix, weights)))
    
    port_returns.append(p_return)
    port_volatility.append(p_volatility)

# Plotting the Efficient Frontier
plt.figure(figsize=(10, 6))
plt.scatter(port_volatility, port_returns, c=np.array(port_returns)/np.array(port_volatility), marker='o', s=10, alpha=0.3)
plt.colorbar(label='Sharpe Ratio')
plt.title('Efficient Frontier Monte Carlo Simulation')
plt.xlabel('Annualized Volatility (Risk)')
plt.ylabel('Annualized Return')
plt.grid(True)
plt.show()

#25

#26

The Monte Carlo simulation has successfully generated 5,000 random portfolios, visualizing the tradeoff between risk (annualized volatility) and reward (annualized return). The resulting scatter plot forms the "bullet" shape characteristic of the efficient frontier, where the upper boundary represents the portfolios that offer the maximum return for a given level of risk. The color gradient identifies portfolios with higher Sharpe Ratios, highlighting the optimal portfolios that provide the best risk-adjusted returns.

#27

identify the max Sharpe ratio portfolio and print the optimal weights

#28

Thinking

The user wants to identify the portfolio with the maximum Sharpe ratio from the Monte Carlo simulation already performed and print the weights associated with that portfolio.

*   `tickers`: List of assets.
*   `port_returns`: List of annualized returns for 5000 portfolios.
*   `port_volatility`: List of annualized volatilities for 5000 portfolios.
*   `annual_returns`: Series of expected annual returns for individual assets.
*   `cov_matrix`: Covariance matrix.
*   The current logic in `cell-4` calculates returns and volatility but does *not* store the specific weights for each of the 5,000 simulations.

*   The weights were generated inside the loop: `weights = np.random.random(len(tickers))`.
*   Since the weights weren't saved to a list or array in the previous cell, I cannot simply "lookup" the weights of the max Sharpe portfolio from the existing lists.
*   *Correction*: I need to re-run the simulation (or a similar logic) but this time store the weights, OR since I can't change the past cells, I must provide a new code block that performs the simulation again, storing weights, and then identifies the max Sharpe ratio.
*   Wait, if I just want the *best* portfolio, I should probably repeat the simulation loop but save weights, or use an optimizer. However, the user is asking based on the "Monte Carlo simulation" context.
*   Actually, I can just re-run the simulation loop and store the weights in a list.

*   1. Calculate the Sharpe ratio for each portfolio (Return / Volatility).
*   2. Find the index of the maximum Sharpe ratio.
*   3. Retrieve the weights associated with that index.
*   4. Print the weights and the metrics (Return, Volatility, Sharpe Ratio).

*   I will re-run the loop (minimal overhead for 5000 iterations) and save weights to an array.
*   Identify the index of the max Sharpe ratio.
*   Map the weights back to the tickers.

*   Compute Sharpe ratios: `sharpe_ratios = np.array(port_returns) / np.array(port_volatility)`.
*   Find `max_index = np.argmax(sharpe_ratios)`.
*   Since I didn't save weights in `cell-4`, I have to re-simulate or just explain that I need to re-run the simulation to capture the weights.

*Wait*, look at the user request: "identify the max Sharpe ratio portfolio and print the optimal weights".
If I just re-run the exact same simulation logic with a seed or just a new run, I can get the result. Since the simulation is random, the "optimal" weights found in a new simulation of 5,000 will be very close to the previous one.

*Wait*, can I calculate the optimal weights using optimization instead of Monte Carlo? The user asked to "identify" it, implying from the simulation context. But to get the *exact* weights from the previous plot, I would have needed to store them. Since they weren't stored, I'll re-run the simulation while storing weights.

*Plan*:
1. Re-run the Monte Carlo simulation loop.
2. Store weights in a list.
3. Calculate Sharpe ratios.
4. Find the index of the max Sharpe ratio.
5. Display the weights mapped to tickers.

*Actually*, let's make it cleaner. I can just run the loop again and save the weights.

*Wait*, there's a faster way. If the user wants the *optimal* weights, they might want the mathematically optimal ones (using `scipy.optimize`), but given the conversation history, they are likely referring to the best one found in the random search.

*Step*:
1. Re-run simulation, storing weights.
2. Find index of max (return/volatility).
3. Show weights for that index.

*Refining code*:
```python
# Re-simulate to capture weights
all_weights = []
# ... loop ...
# weights = ...
# all_weights.append(weights)
# ...
# sharpe = np.array(port_returns) / np.array(port_volatility)
# best_idx = np.argmax(sharpe)
# best_weights = all_weights[best_idx]
```

*Check*: Do I need to worry about risk-free rate? The previous code used `p_return / p_volatility`, effectively assuming a risk-free rate of 0. I should stay consistent with the previous cell's logic.

*Final Polish*:
- Provide code to re-run simulation + save weights.
- Extract max Sharpe portfolio.
- Format weights into a readable Pandas Series.

#29

Since the weights for the 5,000 portfolios were not stored in the previous step, I will re-run the simulation while capturing the weights. I'll then identify the portfolio with the highest Sharpe ratio and map those weights back to the specific tickers.

#30

Notebook code

# Re-run simulation to capture weights
all_weights = []
port_returns = []
port_volatility = []

for _ in range(num_portfolios):
    weights = np.random.random(len(tickers))
    weights /= np.sum(weights)
    all_weights.append(weights)
    
    p_return = np.dot(weights, annual_returns)
    p_volatility = np.sqrt(np.dot(weights.T, np.dot(cov_matrix, weights)))
    
    port_returns.append(p_return)
    port_volatility.append(p_volatility)

# Calculate Sharpe Ratios (assuming risk-free rate = 0 for consistency)
sharpe_ratios = np.array(port_returns) / np.array(port_volatility)

# Find the index of the maximum Sharpe Ratio
max_sharpe_idx = np.argmax(sharpe_ratios)
best_weights = all_weights[max_sharpe_idx]

# Create a summary of the optimal portfolio
optimal_portfolio = {
    'Ticker': tickers,
    'Weight': best_weights
}
optimal_df = pd.DataFrame(optimal_portfolio).set_index('Ticker')

# Display metrics and weights
print(f"Max Sharpe Ratio: {sharpe_ratios[max_sharpe_idx]:.4f}")
print(f"Expected Annual Return: {port_returns[max_sharpe_idx]:.2%}")
print(f"Annualized Volatility: {port_volatility[max_sharpe_idx]:.2%}")
print("\nOptimal Weights:")
optimal_df

#31

[31m---------------------------------------------------------------------------[39m
[31mNameError[39m                                 Traceback (most recent call last)
[36mCell[39m[36m [39m[32mIn[5][39m[32m, line 29[39m
[32m     25[39m optimal_portfolio = {
[32m     26[39m     [33m'Ticker'[39m: tickers,
[32m     27[39m     [33m'Weight'[39m: best_weights
[32m     28[39m }
[32m---> [39m[32m29[39m optimal_df = pd.DataFrame(optimal_portfolio).set_index([33m'Ticker'[39m)
[32m     30[39m 
[32m     31[39m [38;5;66;03m# Display metrics and weights[39;00m
[32m     32[39m print(f"Max Sharpe Ratio: {sharpe_ratios[max_sharpe_idx]:.4f}")

[31mNameError[39m: name 'pd' is not defined

Score cards for gemma4:31b for Portfolio Optimization in Python

Total score: 5/10

Task Completion: 1/2
Needs work
The notebook completed the download, correlation matrix, and Monte Carlo frontier, but the final optimal-weights step failed with a NameError before producing the requested pie chart and printed portfolio metrics.
Execution Correctness: 1/2
Needs work
Most code is runnable, but the last cell errors because pandas was not imported as `pd`, so the workflow does not execute end-to-end.
Output Quality: 2/3
Good
It produced the price table, a 5x5 correlation matrix, and an efficient frontier plot. However, the expected optimal weights pie chart and final Sharpe/return/volatility output are missing due to the error.
Reasoning Quality: 1/2
Needs work
The workflow reasoning is mostly sensible, but it incorrectly assumes the final portfolio summary can be built without importing pandas and does not recover from the failure.
Reliability: 0/1
Needs work
The workflow is fragile because a missing import breaks the final required step, preventing a complete and consistent result.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Portfolio Optimization in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gemma4:31b

Score cards for gemma4:31b for Portfolio Optimization in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics