Canceling Operations with Dependencies.

Concurrent hard is programming.

Context

I have to implement a long-running, multi-step, cancelable task. For simplicity, let’s assume there are 3 steps. Cancellation might be requested by the user, or by the task itself – to abort early in case an error occurs.

Obviously, this task cannot be run on the main thread, so I’ll have to run it on some kind of background queue.

Implementation

I generally use Dispatch for this kind of problem, but there’s no support for cancellation and handling that manually sounds… inconvenient.

So I choose OperationQueue (a.k.a. NSOperationQueue), which supports cancellation. Each step of the task can be modeled as a single Operation.

OperationQueue, concurrent at heart, can be made serial by setting its maxConcurrentOperationCount property to 1. It, however, makes no guarantee that the operations will be executed in the order in which they are provided to the queue.

To fix that, the docs suggest we create dependencies between the operations to enforce our desired execution order. I create a chain where each step is given a dependency on the previous. This way, the only step that is ready to execute at any given time, as it has no unsatisfied dependencies, is the first one.

let queue = OperationQueue()
queue.maxConcurrentOperationCount = 1

let operations: [Operation] = [step1, step2, step3]

step3.addDependency(step2)
step2.addDependency(step1)

queue.addOperations(allOperations, waitUntilFinished: true)

And because each of the steps implements its own cancellation correctly, all I need to do to cancel the entire multi-step task is:

queue.cancelAllOperations()

…right?

The Symptom

Rarely, when a step fails and tries to abort the whole task by canceling the queue, this… doesn’t work. All steps are eventually canceled, but canceling during Step 1 sometimes causes Step 3 to start executing for a split second. Sometimes, it’s Step 2 that gets a brief execution window.

This is unacceptable because the dependencies between these operations aren’t just a way to trick OperationQueue into following our intended order; some steps actually need the data produced by previous steps, so this breaks the task’s internal consistency.

Don’t you just love it when bug descriptions start with “Rarely”?

A Note On Cancellation

The docs read:

Canceling the operations does not automatically remove them from the queue or stop those that are currently executing. For operations that are queued and waiting execution, the queue must still attempt to execute the operation before recognizing that it is canceled and moving it to the finished state.

So cancel() doesn’t magically stop running code, it just marks it as cancelled by setting isCancelled on the Operation object. Executing operations should respond to cancel() by finishing as soon as possible. And the queue still needs to go through each non-executing operation, and determine whether to run or skip it based on the aforementioned flag.

But the docs also say that when the queue encounters a non-executing operation where isCancelled is true, the operation will be removed from the queue without running its code at all (i.e. without invoking its main() method). This doesn’t match our observations.

The Bug

To figure this out, we can try to imagine how cancelAllOperations() might be implemented. My guess is:

for operation in operations {
  operation.cancel()
}

And here lies the issue.

This is the part where you stop reading if you want to figure it out for yourself. If you did, welcome back!

Canceling operations in array order means that the first operation to receive a cancel() is Step 1. This marks it as complete, instantly satisfying all of Step 2’s dependencies! So technically, before cancel() is called on it, Step 2 is ready to execute. And when the planets align, it does.

Speculation Corner

I’d have to see the code to be sure, but I think the issue is exacerbated by OperationQueue’s internals.

Imagine that Step 1 fails and wants to abort the entire task. To do so, it calls cancelAllOperations() on the queue (probably indirectly via a delegate callback, but that’s not important right now).

In turn, cancelAllOperations() starts by calling cancel() on Step 1. This changes Step 1’s isCancelled flag, which (judging by the documentation) is observed by the queue via KVO.

In the KVO observer, OperationQueue can realize that Step 2’s dependencies are satisfied and change its flags to mark it as ready to execute.

This change will in turn be observed by the queue, which can decide to kick off Step 2.

And because all of the above happens on the same underlying DispatchQueue where Step 1 is running, which is now free since the task is serial and Step 1 is done, the queue might decide that this is a fantastic place to run Step 2 and just start it synchronously – all before the call to cancel() on Step 1 even returns.

So even if the thread executing cancelAllOperations() isn’t pre-empted halfway through its loop, the bug could still strike.

The Fix

Let’s make sure to never accidentally satisfy any step’s dependencies. On any serial, ordered example like the above, all we need to do is cancel the steps in reverse order. At no point during this there will be a step which is not cancelled and has satisfied dependencies (other than the currently executing one, of course).

for operation in queue.operations.reversed() {
  operation.cancel()
}

For more complex dependency trees, operations will need to be canceled from the leaves up.

Existential Questions

I’m not even sure whether this can be considered a bug in OperationQueue. Intuitively, it seems like cancelAllOperations() should mark all operations as cancelled in a semantically atomic way. I definitely did not expect it to start executing anything. But then again, the documentation doesn’t make any promises either way.

Two days ago me sure wished this post existed. Let me know if this helped you!