Data EngineerPipeline DesignIntermediateSingle prompt

Spark Job Optimization AI Prompt

This prompt focuses on improving Spark jobs by diagnosing the real bottleneck before suggesting tuning changes. It helps teams reason about partitioning, skew, joins, caching, shuffle behavior, and cluster configuration in a disciplined way. The goal is not random tuning tips, but an optimization plan tied to runtime and cost impact. Copy this prompt template, run it in your AI tool, and use related prompts to continue the workflow.

Prompt text

Optimize this Spark job for performance, cost, and reliability.

Current job: {{job_description}}
Current runtime: {{current_runtime}}
Current cost: {{current_cost}}

1. Diagnose first with Spark UI:
   - Identify stages with the longest duration
   - Check for data skew: are some partitions processing 10× more data than others?
   - Check for shuffle volume: large shuffles are the most common performance killer
   - Check for spill: memory spill to disk indicates insufficient executor memory

2. Partitioning optimization:
   - Repartition before a join or aggregation to the right number of partitions: num_partitions = total_data_size_GB × 128 (for 128MB partitions)
   - Use repartition(n, key_column) to co-locate related records and reduce shuffle
   - Use coalesce() to reduce partition count before writing (avoids full shuffle)

3. Join optimization:
   - Broadcast join: use for any table < {{broadcast_threshold_mb}}MB — eliminates shuffle entirely
   - Sort-merge join (default): ensure both sides are partitioned and sorted on the join key
   - Skew join: handle skewed keys by salting (append a random prefix to the key)

4. Data skew handling:
   - Identify skewed keys: GROUP BY join_key ORDER BY COUNT DESC LIMIT 20
   - Salt skewed keys: join_key_salted = concat(join_key, '_', floor(rand() * N))
   - Process skewed keys separately and union with normal results

5. Caching strategy:
   - cache() / persist() DataFrames used more than once in the same job
   - Use MEMORY_AND_DISK_SER for large DataFrames that don't fit in memory
   - Unpersist cached DataFrames when no longer needed

6. Configuration tuning:
   - spark.sql.adaptive.enabled=true (AQE): enables runtime partition coalescing and join strategy switching
   - spark.sql.adaptive.skewJoin.enabled=true: automatically handles skewed joins
   - Executor memory = (node_memory - overhead) / executors_per_node

Return: diagnosis procedure, optimization implementations with estimated impact, and configuration recommendations.

When to use this prompt

Use case 01

When a Spark job is slow, expensive, or unstable.

Use case 02

When analyzing Spark UI evidence after repeated job failures or long runtimes.

Use case 03

When preparing a performance tuning plan for a critical batch workflow.

Use case 04

When you need concrete Spark code and configuration changes with expected impact.

What the AI should return

Return a diagnosis-first optimization plan, not just a list of best practices. Identify the likely bottlenecks, then recommend specific changes to partitioning, joins, skew handling, caching, and configuration. For each recommendation, include why it helps, how to implement it, and the estimated runtime or cost benefit.

How to use this prompt

Open your data context

Load your dataset, notebook, or working environment so the AI can operate on the actual project context.

Copy the prompt text

Use the copy button above and paste the prompt into the AI assistant or prompt input area.

Review the output critically

Check whether the result matches your data, assumptions, and desired format before moving on.

Chain into the next prompt

Once you have the first result, continue deeper with related prompts in Pipeline Design.

Frequently asked questions

What does the Spark Job Optimization prompt do?+

It gives you a structured pipeline design starting point for data engineer work and helps you move faster without starting from a blank page.

Who is this prompt for?+

It is designed for data engineer workflows and marked as intermediate, so it works well as a guided starting point for that level of experience.

What type of prompt is this?+

Spark Job Optimization is a single prompt. You can copy it as-is, adapt it, or use it as one step inside a larger workflow.

Can I use this outside MLJAR Studio?+

Yes. The prompt text works in other AI tools too, but MLJAR Studio is the best fit when you want local execution, visible Python code, and reusable notebooks.

What should I open next?+

Natural next steps from here are Backfill Strategy, DAG Design for Airflow, dbt Project Structure.

Run this prompt on your data

MLJAR Studio runs prompt-driven workflows locally, keeps the generated Python visible, and turns the result into a reusable notebook.

Try Studio free

Desktop app · Windows, macOS, Linux

Prompt metadata

Role: Data Engineer
Category: Pipeline Design
Level: Intermediate
Type: Single prompt
Works with: Any AI tool with data access
License: Free to use

Related AI prompts

Backfill Strategy

Pipeline Design · Advanced

DAG Design for Airflow

Pipeline Design · Beginner

dbt Project Structure

Pipeline Design · Intermediate

Incremental Load Design

Pipeline Design · Intermediate

Explore more

Data Engineer library

AI prompts for data engineers covering ETL and ELT pipelines, data warehouses, data modeling, infrastructure design, schema contracts, orchestration, and data quality validation.

Browse all Data Engineer prompts

Browse Pipeline Design prompts