Justy: Cody, this one hits because stale data is the invisible product bug. The page loads, the ad shows, search works, but it’s quietly yesterday.
Cody: Yeah, and the article is basically a scar-tissue writeup on that. They had a delta index pipeline for search and ads retrieval, and the painful part wasn’t that Spark couldn’t process the data. It was that scheduled batch runs left dead time everywhere.
Justy: Also, I made coffee in your kitchen and somehow used three mugs for one drink, so I’m already in pipeline-waste mode. [chuckles] Anyway, this system had new incremental data arriving every five to seven minutes, right?
Cody: Right. Ads, campaign updates, product and item signals, customer signals, conversion, performance, co-purchase stuff. The full index was hundreds of gigs and took two to three hours to rebuild, then validation and deployment could push the whole thing near five hours. So they had a smaller delta index, maybe around a tenth of the full index, to keep production fresh between full swaps.
Justy: The user story is pretty plain: somebody updates an ad or campaign metadata, and the retrieval system should reflect that before the moment is gone. If an ad burns through budget before the latest version lands, that’s not an elegant distributed systems problem. That’s just money and relevance leaking out of the product.
Cody: And the clever move was not jumping all the way to per-record streaming. They tried that direction, or at least started there, and it didn’t match the shape of the work. The indexing logic grouped around product or item level representations, then expanded back out, so one ad change could mean recomputing a grouped chunk. If you stream single records, you risk having part of the grouped index updated and part of it lagging behind.
Justy: That’s the adoption barrier I recognize. People hear “streaming” and immediately picture a whole new operational lifestyle. More dashboards, weird long-running job failures, on-call anxiety, and a team quietly asking if the batch job was ugly but at least understandable.
Cody: Totally fair anxiety. [pause] Their compromise was Spark Structured Streaming in micro-batch mode, but not with Kafka-style event semantics. Data was time-partitioned files in S3-style object storage. They used streaming more like a continuous executor that wakes up, finds the partition state, processes a bounded slice, and moves on.
Justy: So it’s not trying to make every individual record sacred. It’s saying, the meaningful unit is a complete partition, and the product wants freshness within minutes, not a perfectly replayed museum of every intermediate state.
Cody: Exactly. And they didn’t lean on Spark’s native checkpointing or event-time watermarks as the source of truth. They kept an external logical watermark, basically the latest processed partition timestamp. That matters because the input wasn’t an ordered event log. It was object storage listings and partition interpretation, which is a very different beast.
Justy: The success-file bit stood out to me. I’ve seen teams treat completion markers like little magic stamps. Then reality shows up: eventual consistency, late files, retries, weird producer behavior, and suddenly the stamp is less comforting.
Cody: Yeah. [exhales] The article’s take is that deterministic, rate-based progress can be more reliable than waiting for perfect completion signals. In steady state, process the latest available partition according to a controlled cadence. For this freshness path, skipping ahead after lag could be better than replaying every missed historical partition, because the output covers overlapping windows anyway, around the last five hours.
Justy: I like that, but I’d be careful selling it outside this exact shape. If you’re doing billing, compliance-y reconciliation, or anything where every intermediate state matters, “skip to latest” is a trap. For ads retrieval freshness, though, I get it. The latest usable index is the thing users feel.
Cody: Same. The questionable part is mostly portability of the lesson. The strong lesson is not “always skip backlog.” It’s design lag behavior explicitly. They treated restarts as normal too, which I love. Long-running jobs leak memory, dependencies wobble, clusters get weird. A clean scheduled restart can be a feature if the watermark and partition logic are solid.
Justy: Build-next version, I’d keep it tiny. Take Spark Structured Streaming, write partitioned Parquet files into MinIO every few minutes, and maintain a watermark row in SQLite or Postgres. Then deliberately kill the job and see if it resumes from your logical partition instead of whatever Spark feels like doing.
Cody: Yeah, solo builder weekend: clone something like apache/spark examples, run `docker compose up` with MinIO, then launch `pyspark` with a local checkpoint directory. Generate folders like `dt=2026-05-04/hour=10/minute=05`, process only the newest complete-looking partition, and write a small inverted index file, even just term to document IDs. If you want the lakehouse flavor, try delta-io/delta with Spark, but don’t confuse Delta Lake with their “delta index” concept. Differe
Justy: Three deltas walk into a backlog, nobody knows who owns the ticket. Cody, that’s useful. Tiny version first, then decide if streaming is actually earned.