TensorFlow-Metal Error "could not find registered platform" on Intel Mac

Following the TensorFlow-Metal installation instructions, I get the following error when running the test script:

2023-01-19 19:09:54.661559: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x7f7fff9c4750

My system: Mac mini (2018) MacOS Monterey Version 12.6.2 Memory: 32GB Graphics: (eGPU) AMD Radeon RX Vega 64 8GB

Replies

Additional Info: Python3 Version 3.10

418 : NOT_FOUND: could not find registered platform with id: 0x7f7fff9c4750 Traceback (most recent call last): File "/Users/jnevin/Library/Mobile Documents/comappleCloudDocs/JimStuff/RealtoStudios/VirtualRealityTheater/ML-HumanReconstruct/test-if-tensorflow-mac-gpu.py", line 13, in model.fit(x_train, y_train, epochs=5, batch_size=64) File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.NotFoundError: Graph execution error:

Detected at node 'StatefulPartitionedCall_212' defined at (most recent call last): File "/Users/jnevin/Library/Mobile Documents/comappleCloudDocs/JimStuff/RealtoStudios/VirtualRealityTheater/ML-HumanReconstruct/test-if-tensorflow-mac-gpu.py", line 13, in model.fit(x_train, y_train, epochs=5, batch_size=64) File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/engine/training.py", line 1650, in fit tmp_logs = self.train_function(iterator) File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/engine/training.py", line 1249, in train_function return step_function(self, iterator) File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/engine/training.py", line 1233, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/engine/training.py", line 1222, in run_step outputs = model.train_step(data) File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/engine/training.py", line 1027, in train_step self.optimizer.minimize(loss, self.trainable_variables, tape=tape) File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize self.apply_gradients(grads_and_vars) File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients return super().apply_gradients(grads_and_vars, name=name) File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients iteration = self._internal_apply_gradients(grads_and_vars) File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients return tf.internal.distribute.interim.maybe_merge_call( File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn distribution.extended.update( File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var return self._update_step_xla(grad, var, id(self._var_key(var))) Node: 'StatefulPartitionedCall_212' could not find registered platform with id: 0x7f7fff9c4750 [[{{node StatefulPartitionedCall_212}}]] [Op:__inference_train_function_23355]

Hi @jnevin,

In base Tensorflow v2.11, the Optimizer api changed and it broke the current pluggable architecture as jit_compile=True was turned on by default for optimizers. This path goes to XLA, which is not supported by Pluggable devices. We are working on a fix to workaround this issue. Meanwhile can you use the Legacy optimizer API to fix the issue:

import tensorflow as tf
from tensorflow.keras.optimizers.legacy import Adam

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=Adam(), loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64)

Thanks for the timely response.

However, the recommended Legacy optimizer API fix did not correct the issue.

I interactively re-ran the updated script using the Legacy optimizer.

The script succeeds up to the final step, including the creation of the Pluggable TensorFlow device.

It fails in the first Epoch when it attempts to fit the model to the training data, i.e.:

>>> model.fit(x_train, y_train, epochs=5, batch_size=64)

Note the UserWarning:

UserWarning: "sparse_categorical_crossentropyreceivedfrom_logits=True, but the output argument was produced by a Softmax activation and thus does not represent logits. Was this intended?

Then note:

OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x7fc2eb58dc70

HERE IS A FULL DUMP OF THE SESSION:

2023-01-21 17:12:37.085285: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> from tensorflow.keras.optimizers.legacy import Adam
>>> cifar = tf.keras.datasets.cifar100
>>> (x_train, y_train), (x_test, y_test) = cifar.load_data()
>>> model = tf.keras.applications.ResNet50(
...     include_top=True,
...     weights=None,
...     input_shape=(32, 32, 3),
...     classes=100,)
2023-01-21 17:14:51.596506: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Metal device set to: AMD Radeon RX Vega 64

systemMemory: 32.00 GB
maxCacheSize: 3.99 GB

2023-01-21 17:14:51.597304: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-01-21 17:14:51.597346: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
>>> loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
>>> model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])

>>> model.fit(x_train, y_train, epochs=5, batch_size=64)
Epoch 1/5
/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/backend.py:5585: UserWarning: "`sparse_categorical_crossentropy` received `from_logits=True`, but the `output` argument was produced by a Softmax activation and thus does not represent logits. Was this intended?
  output, from_logits = _get_logits(
2023-01-21 17:18:33.774397: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-01-21 17:18:36.989723: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x7fc2eb58dc70
2023-01-21 17:18:36.989759: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x7fc2eb58dc70

.............  (same as above)

2023-01-21 17:18:38.299162: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x7fc2eb58dc70
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.NotFoundError: Graph execution error:

Detected at node 'StatefulPartitionedCall_212' defined at (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/engine/training.py", line 1650, in fit
      tmp_logs = self.train_function(iterator)
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/engine/training.py", line 1249, in train_function
      return step_function(self, iterator)
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/engine/training.py", line 1233, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/engine/training.py", line 1222, in run_step
      outputs = model.train_step(data)
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/engine/training.py", line 1027, in train_step
      self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
      self.apply_gradients(grads_and_vars)
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
      return super().apply_gradients(grads_and_vars, name=name)
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
      iteration = self._internal_apply_gradients(grads_and_vars)
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
      return tf.__internal__.distribute.interim.maybe_merge_call(
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
      distribution.extended.update(
    File "/Users/jnevin/venv-metal/lib/python3.10/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
      return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_212'
could not find registered platform with id: 0x7fc2eb58dc70
	 [[{{node StatefulPartitionedCall_212}}]] [Op:__inference_train_function_23355]
>>>

Looks that even when using the Legacy Optimizer that the path is still going to the unsupported XLA.

OK, made one adjustment in model.compile parameters, now:

model.compile(optimizer=Adam(), loss=loss_fn, metrics=["accuracy"])

and it is loading the Legacy optimizer, but producing a Segmentation Fault after an "unrecognized selector"

here:


Metal device set to: AMD Radeon RX Vega 64

systemMemory: 32.00 GB
maxCacheSize: 3.99 GB

2023-01-22 17:00:28.647637: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-01-22 17:00:28.647677: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Epoch 1/5
/Users/jnevin/miniconda3/envs/tf-metal/lib/python3.10/site-packages/keras/backend.py:5585: UserWarning: "`sparse_categorical_crossentropy` received `from_logits=True`, but the `output` argument was produced by a Softmax activation and thus does not represent logits. Was this intended?
  output, from_logits = _get_logits(
2023-01-22 17:00:36.273504: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-01-22 17:00:36.949 python3[20749:1246068] -[MPSGraph adamUpdateWithLearningRateTensor:beta1Tensor:beta2Tensor:epsilonTensor:beta1PowerTensor:beta2PowerTensor:valuesTensor:momentumTensor:velocityTensor:gradientTensor:name:]: unrecognized selector sent to instance 0x6000199908c0
Segmentation fault: 11

I also encountered this problem and managed to solve it by using an older version of tensorflow, here is the article.

https://medium.com/@yningz/how-to-install-and-actually-run-with-tensorflow-on-m1-m2-mac-with-metal-plugin-in-3-steps-81341d0a9363