My Spark Job Kept OOMing — Until I Broke the Lineage

I spent the better part of a day staring at a Spark driver log full of java.lang.OutOfMemoryError and StackOverflowError exceptions. The job was an ETL pipeline that had always run fine. Nothing obvious had changed. After bumping executor memory (no help), tweaking partitions (no help), and reading more Stack Overflow threads than I care to admit, I found the culprit: a concept called lineage, and the fact that I wasn't managing it. A coworker eventually pointed me toward the fix — one of those things that's obvious once you know it.

Here's what I learned.

What Spark Lineage Is

Spark is lazy. When you call .filter(), .join(), .groupBy(), none of that runs immediately. Spark just records the instruction — "when I need the result, do this transformation to that parent dataset." This chain of recorded instructions is the lineage, and it's represented internally as a directed acyclic graph (DAG).

Lineage is actually a feature. It lets Spark recover from executor failures by recomputing lost partitions from scratch. No data needs to be written to disk; Spark just replays the chain. Elegant.

The problem is that the chain keeps growing.

When Lineage Becomes a Liability

The job was an ETL pipeline built around two main datasets that combined to over 90 million records. On top of those, there were a bunch of supporting datasets being joined in to enrich the output — lookup tables, dimension data, the usual. Lots of joins, lots of transforms, and near the end a final SQL UNION to stitch everything together.

Nothing about it was iterative. It was just a long, wide DAG. But by the time Spark got to the final union, the logical plan describing how to compute every input branch had grown enormous. The driver ran out of memory trying to optimize and serialize it — long before any executor had a chance to struggle with the actual data.

The data volume was never the problem. The plan describing how to compute it was.

The Fix: Write to S3 and Read Back

The trick my coworker showed me was almost embarrassingly simple: write the DataFrame to S3 and immediately read it back.

When Spark reads a Parquet file, it has no idea how that file was produced. There's no lineage to inherit — it's just data on disk. The DAG resets to zero.

In my case, the right place to break lineage was right before the final union. Each side of the union had its own deep chain of joins and transforms. By materializing one (or both) sides to S3 and reading them back as fresh DataFrames, the union saw two simple Parquet sources instead of two giant subgraphs.

INTERMEDIATE = "s3://my-bucket/intermediate/pre-union/"

# left_df and right_df are each the result of many joins and transforms
left_df = build_left_side(...)   # deep DAG
right_df = build_right_side(...) # deep DAG

# Materialize the heavier side to S3 and re-read it.
left_df.write.mode("overwrite").parquet(INTERMEDIATE)
left_df = spark.read.parquet(INTERMEDIATE)
# left_df is now a fresh parquet read — its lineage is gone.

left_df.createOrReplaceTempView("left_side")
right_df.createOrReplaceTempView("right_side")

result = spark.sql("""
    SELECT * FROM left_side
    UNION ALL
    SELECT * FROM right_side
""")

result.write.parquet("s3://my-bucket/output/")

That single write-and-read was enough. The driver no longer had to hold the full plan for both branches plus the union on top — it just had to plan the union over two Parquet inputs. The OOMs stopped.

Other Approaches Worth Knowing

Spark also has checkpoint() — a built-in way to truncate lineage that writes to HDFS or S3 under the hood. It's conceptually the same thing, and works well if you want Spark to manage the checkpoint directory for you. The S3 write-and-read approach is more explicit, and I find it easier to reason about: the intermediate path is right there in the code, it survives job restarts, and you can inspect the data at any point.

cache() and persist() are different — they keep data in memory to avoid recomputation, but they don't truncate the lineage. Useful for other problems; not the right tool here.

What Actually Fixed My Job

For me, one write-and-read inserted right before the final union was enough. No loop, no checkpoint cadence to tune — just a single materialization at the spot where two deep subgraphs were about to be combined. The driver memory pressure dropped immediately, and the job finished cleanly on the next run.

It's invisible until it bites you. Lineage doesn't show up in your data or your code — only in the driver's memory. Now I treat a write-and-read as a standard tool in my toolkit, not a hack. If a job has a lot of joins feeding into a final combine step, I break the lineage right before the combine and move on.

← Back to posts