TensorFlow crashes with a segfault

I have some script which crashes on Apple M1 hardware with tensorflow-macos 2.11.0. It should not crash. It does not crash on other hardware.

I think the code should work. It does work fine on other hardware. But even if there is sth wrong with the code, it still should not crash, but throw some exception instead.

I also reported this here: https://github.com/tensorflow/tensorflow/issues/59780

On Apple M1 hardware:

  • Checkout https://github.com/rwth-i6/returnn. (Maybe commit 3a67da87c2fd8783c5c2469d72cf1319b5b45837 to be sure.)
  • Run: python3 tests/test_TFUtil.py test_get_variable_grad_from_update_ops

The relevant code:

Relevant log output

...
grad: Tensor("test_get_variable_grad_from_update_ops/gradients_2/test_get_variable_grad_from_update_ops/sub_grad/tuple/control_dependency:0", shape=(), dtype=float32)
Fatal Python error: Segmentation fault

Thread 0x0000000103500580 (most recent call first):
  File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1454 in _call_tf_sessionrun
  File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1361 in _run_fn
  File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1378 in _do_call
  File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1371 in _do_run
  File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1191 in _run
  File "/Users/az/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 968 in run
  File "/Users/az/Programmierung/crnn/tests/test_TFUtil.py", line 3529 in test_get_variable_grad_from_update_ops
  File "/Users/az/Programmierung/crnn/tests/test_TFUtil.py", line 4559 in <module>
fish: Job 1, 'python3 tests/test_TFUtil.py te…' terminated by signal SIGSEGV (Address boundary error)

Stack trace in LLDB in the crashing thread:

* thread #28, queue = 'metal gpu stream', stop reason = EXC_BAD_ACCESS (code=1, address=0xbeaddc3f8010)
  * frame #0: 0x00000001836ea5a0 libobjc.A.dylib`objc_msgSend + 32
    frame #1: 0x000000018df96d38 MPSNDArray`___lldb_unnamed_symbol1550 + 2292
    frame #2: 0x000000018df98bbc MPSNDArray`___lldb_unnamed_symbol1567 + 300
    frame #3: 0x000000018df991e8 MPSNDArray`___lldb_unnamed_symbol1569 + 176
    frame #4: 0x0000000159a7d2b8 libmetal_plugin.dylib`invocation function for block in double dispatchOneKernel<MPSNDArrayIdentity>(MetalStream*, MPSNDArrayIdentity*, NSArray*, MPSNDArray*, char const*, MPSKernelDAGObject*) + 120
    frame #5: 0x00000001836a01b4 libdispatch.dylib`_dispatch_client_callout + 20
    frame #6: 0x00000001836af414 libdispatch.dylib`_dispatch_lane_barrier_sync_invoke_and_complete + 56
    frame #7: 0x0000000159a7d140 libmetal_plugin.dylib`double dispatchOneKernel<MPSNDArrayIdentity>(MetalStream*, MPSNDArrayIdentity*, NSArray*, MPSNDArray*, char const*, MPSKernelDAGObject*) + 120
    frame #8: 0x0000000159a7fffc libmetal_plugin.dylib`metal_plugin::MPSApplyMomentumOp<float>::Compute(metal_plugin::OpKernelContext*) + 2768
    frame #9: 0x0000000159a7f2fc libmetal_plugin.dylib`void metal_plugin::ComputeOpKernel<metal_plugin::MPSApplyMomentumOp<float> >(void*, TF_OpKernelContext*) + 44
    frame #10: 0x000000014cd00028 libtensorflow_framework.2.dylib`tensorflow::PluggableDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) + 148
    frame #11: 0x000000014cc847f0 libtensorflow_framework.2.dylib`tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long long) + 3764
    frame #12: 0x000000028a47eb6c _pywrap_tensorflow_internal.so`Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) + 1496
    frame #13: 0x000000028a47e468 _pywrap_tensorflow_internal.so`tsl::thread::EigenEnvironment::CreateThread(std::__1::function<void ()>)::'lambda'()::operator()() const + 80
    frame #14: 0x000000014cb9e878 libtensorflow_framework.2.dylib`tsl::(anonymous namespace)::PThread::ThreadFn(void*) + 120
    frame #15: 0x000000018386426c libsystem_pthread.dylib`_pthread_start + 148

As you see from the output, the crash happens in the last session.run([minimize_op, grad]).