Metal Performance Shaders

RSS for tag

Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.

Metal Performance Shaders Documentation

Posts under Metal Performance Shaders tag

39 Posts
Sort by:
Post not yet marked as solved
1 Replies
171 Views
I'm brand new to Metal. I've googled, but can't get the right answer to come up. (Thanks, unhelpful ChatGPT generated answers polluting everything, but I digress...) Ultimately, I'm trying to figure out how to use Metal to render 3D DICOM data on iOS specifically. If you're not familiar with DICOM, let's just say I've got a whole stack of CT image slices. Or to get really simple, I've got a cube of voxel values with differing values at each voxel coordinate. Where do I even start in Metal to render something like this? (I was trying to get the VTK toolkit compiled for iOS, which uses OpenGL, but that appears to be a dead end. And besides, Metal is supposed to be so much better.) Thanks for any tips/leads/suggestions/general pointers.
Posted Last updated
.
Post not yet marked as solved
0 Replies
305 Views
Hello, I have been following the excellent/informative "Metal for Machine Learning" from WWDC19 to learn how to do on device training (I have a specific use case for this) and it is all working really well using the MPSNNGraph. However, I would like to call my own metal compute/render function/pipeline to transform the inference result before calculating the loss, does anyone know if this possible and what would this look like in code? Please see my current code below, at the comment I need to call an intermediate compute/render function to transform the inference result image before passing to the MPSNNForwardLossNode. let rgbImageNode = MPSNNImageNode(handle: nil) let inferGraph = makeInferenceGraph() let reshape = MPSNNReshapeNode(source: inferGraph.resultImage, resultWidth: 64, resultHeight: 64, resultFeatureChannels: 4) //Need to call render or compute pipeline to post process in the inference result image let rgbLoss = MPSNNForwardLossNode(source:reshape.resultImage, labels:rgbImageNode, lossDescriptor:lossDescriptor) let initGrad = MPSNNInitialGradientNode(source:rgbLoss.resultImage) let gradNodes = initGrad.trainingGraph(withSourceGradient:nil, nodeHandler:nil) guard let trainGraph = MPSNNGraph(device: device, resultImage: gradNodes![0].resultImage, resultImageIsNeeded: true) else{ fatalError("Unable to get training graph.") } Thanks
Posted
by IainA.
Last updated
.
Post not yet marked as solved
0 Replies
261 Views
Hi there, I've met a problem that in my working project build settings. There is no Metal Compiler Build Setting, but it works well in my demo project. And I'm certain I've move the shader files(.metal) into the bundle. How could I resolve this problem? XCode Version 15.0 13.6.1 (22G313)
Posted Last updated
.
Post not yet marked as solved
0 Replies
290 Views
it appears that the Metal Debugging interface does not support this method, at least the function hashing algorithm does not have a pattern for it in the symbol dictionary as presented. Where do we get updated C- libraries and functions that sync with the things that are presented in the Demo Kits and Samples that Apple puts in the user domain? Why does this stuff get out into the wild insufficiently tested? It seems thet the demo kits made available to users should be included in the test domain used to verify new code releases. I came from a development environment where the 6 month release cycle involved automated execution of the test suite before it went beta or anywhere else.
Posted
by mfstanton.
Last updated
.
Post not yet marked as solved
0 Replies
313 Views
Hello, I'm currently facing an issue while working with MetalPerformanceShaders in testing a Python project. Code: https://github.com/Thinklab- SJTU/Crossformer Error Description: /AppleInternal/Library/BuildRoots/4e1473ee-9f66-11ee-8daf-cedaeb4cabe2/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:126: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[2] (15)' I've tried updating macOS to the latest version. Also, I've attempted running the code with MPS_NO_DEVICE_CHECK=1 to bypass device checks. But errors happened again. I'm seeking insights or solutions to this problem and If anyone has encountered a similar issue or has suggestions on how to troubleshoot and resolve this assertion failure, I would greatly appreciate your help. Project Details: Language: Python Environment: PyCharm CE Relevant Technologies: MetalPerformanceShaders, MPSNDArray Thank you for your time and assistance!
Posted
by FVery.
Last updated
.
Post not yet marked as solved
0 Replies
261 Views
Hi, I am using xcode frame capture to profile my app's shader. And I got some question about the shader per line profile statistics. Please see the two screen shot first, it is my compute shader. Begin: End: The first image is the head of the shader. The profile show's that the shader entry function takes 72.44% of the time. And at the end of the shader, the profile shows that the right brace '}' takes 60.45%. Here is my question: How to properly understand the profile data? What's the real performance data of this shader? Why the shader entry function does not take 100% of the time? Can someone help me to answer the question? Thanks! Boson
Posted
by Boson.
Last updated
.
Post not yet marked as solved
2 Replies
573 Views
I Instrument's CPU Profiling tool I've noticed that a significant portion (22.5%) of the CPU-side overhead related to MPS matrix multiplication (GEMM) is in a call to getenv(). Please see attached screenshot. It seems unnecessary to perform this same check over and over, as whatever hack that needs this should be able to perform the getenv() only once and cache the result for future use.
Posted
by jacobgorm.
Last updated
.
Post not yet marked as solved
6 Replies
859 Views
I only get this error when using the JAX Metal device (CPU is fine). It seems to be a problem whenever I want to modify values of an array in-place using at and set. note: see current operation: %2903 = "mhlo.scatter"(%arg3, %2902, %2893) ({ ^bb0(%arg4: tensor<f32>, %arg5: tensor<f32>): "mhlo.return"(%arg5) : (tensor<f32>) -> () }) {indices_are_sorted = true, scatter_dimension_numbers = #mhlo.scatter<update_window_dims = [0, 1], inserted_window_dims = [1], scatter_dims_to_operand_dims = [1]>, unique_indices = true} : (tensor<10x100x4xf32>, tensor<1xsi32>, tensor<10x4xf32>) -> tensor<10x100x4xf32> blocks = blocks.at[i].set( ...
Posted
by Cemlyn.
Last updated
.
Post not yet marked as solved
2 Replies
468 Views
device: iphone 11 os: ios 15.6 I have a metal applicaton on IOS where a series of computer shaders are encoded, then disptached and comiited together at last. When I capture a GPU trace of my application, however I noticed there are these gaps between each computer shader invocation. And these gaps seem to take up a big part of the GPU time. I'm wondering what are these gaps and what are causing them. Since all compute dispatch commands are commiited toghether at once, these gaps shouldn't be synchronizations between cpu and GPU PS: In my application, later compute commands mostly depend on former ones and would use the result buffer from former invocations. But as shown in the picture, bandwith and read/write buffer limiter are not high as far as I'm concerned.
Posted Last updated
.
Post not yet marked as solved
2 Replies
695 Views
I'm experimenting with Vision OS and Apple Vision Pro using the Xcode Beta. I'm using Xcode 15.1 Beta and visionOS 1.0 beta 4. I'm currently doing a project where I draw a polygon using a mesh generated from MeshDescriptor/MeshResource and present it in an ImmersiveView. I want to change the color of parts, i.e. not all of, my 3D rendered polygon and I want to do it dynamically. For example when the user presses a button. I have gotten into Shaders and the CustomMaterial from RealityKit, only to find out that CustomMaterial is not supported on Vision OS! Does anyone know how I can color portions/parts of a mesh that is generated from MeshDescriptor and MeshResource?
Posted
by tom_swift.
Last updated
.
Post not yet marked as solved
7 Replies
1.6k Views
Hello - I have been struggling to find a solution online and I hope you can help me timely. I have installed the latest tesnorflow and tensorflow-metal, I even went to install the ternsorflow-nightly. My app generates the following as a result of my fit function on a CNN model with 8 layers. 2023-09-29 22:21:06.115768: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro 2023-09-29 22:21:06.115846: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB 2023-09-29 22:21:06.116048: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB 2023-09-29 22:21:06.116264: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2023-09-29 22:21:06.116483: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: ) Most importantly, the learning process is very slow and I'd like to take advantage of al the new features of the latest versions. What can I do?
Posted
by erezkatz.
Last updated
.
Post not yet marked as solved
1 Replies
542 Views
Hello, I've been working on an app that involves training a neural network model on the iPhone. I've been using the Metal Performance Shaders Graph (MPS Graph) for this purpose. In the training process the loss becomes Nan on iOS17 (21A329). I noticed that the official sample code for Training a Neural Network using MPS Graph (link) works perfectly fine on Xcode 14.3.1 with iOS 16.6.1. However, when I run the same code on Xcode 15.0 beta 8 with iOS 17.0 (21A329), the training process produces a NaN loss in function updateProgressCubeAndLoss. The official sample code and my own app exhibit the same issue. Has anyone else experienced this issue? Is this a known bug, or is there something specific that needs to be adjusted for iOS 17? Any guidance would be greatly appreciated. Thank you!
Posted
by shenjiaqi.
Last updated
.
Post not yet marked as solved
1 Replies
1.7k Views
I'm interested in using CatBoost and XGBoost for some machine learning projects on my Mac, and I was wondering if it's possible to run these algorithms on my GPU(s) to speed up training times. I have a Mac with an AMD Radeon Pro 5600M and an Intel UHD Graphics 630 GPUs, and I'm running macOS Ventura 13.2.1. I've read that both CatBoost and XGBoost support GPU acceleration, but I'm not sure if this is possible on my system. Can anyone point me in the right direction for getting started with GPU-accelerated CatBoost/XGBoost on macOS? Are there any specific drivers or tools I need to install, or any other considerations I should be aware of? Thank you.
Posted Last updated
.
Post marked as solved
3 Replies
989 Views
Hello everyone! I have a small concern about one little thing when it comes to programming in metal. There are some models that I wish to use along with animations and skins on them, the file extension for them is called gltf. glTF has been used in a number of projects such as unity and unreal engine and godot and blender. I was wondering if metal supports this file extension or not. Anyone here knows the answer?
Posted Last updated
.
Post not yet marked as solved
0 Replies
413 Views
Hi, I am trying to extend the pytorch library. I would like to add MPS native Cholesky Decomposition. I finally got it working (mostly). But I am struggling to implement the status codes. What I did: // init status id<MTLBuffer> status = [device newBufferWithLength:sizeof(int) options:MTLResourceStorageModeShared]; if (status) { int* statusPtr = (int*)[status contents]; *statusPtr = 42; // Set the initial content to 42 NSLog(@"Status Value: %d", *statusPtr); } else { NSLog(@"Failed to allocate status buffer"); } ... [commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> commandBuffer) { // Your completion code here int* statusPtr = (int*)[status contents]; int statusVal = *statusPtr; NSLog(@"Status Value: %d", statusVal); // Update the 'info' tensor here based on statusVal // ... }]; for (const auto i : c10::irange(batchSize)) { ... [filter encodeToCommandBuffer:commandBuffer sourceMatrix:sourceMatrix resultMatrix:solutionMatrix status:status]; } (full code here: https://github.com/pytorch/pytorch/blob/ab6a550f35be0fdbb58b06ff8bfda1ab0cc236d0/aten/src/ATen/native/mps/operations/LinearAlgebra.mm) But this code prints the following when input with a non positive definite tensor: 2023-09-02 19:06:24.167 python[11777:2982717] Status Value: 42 2023-09-02 19:06:24.182 python[11777:2982778] Status Value: 0 initial tensor: tensor([[-0.0516, 0.7090, 0.9474], [ 0.8520, 0.3647, -1.5575], [ 0.5346, -0.3149, 1.9950]], device='mps:0') L: tensor([[-0.0516, 0.0000, 0.0000], [ 0.8520, -0.3612, 0.0000], [ 0.5346, -0.3149, 1.2689]], device='mps:0') What am I doing wrong? Why do I get a 0 (success) status even tough the matrix is not positive definite. Thank you in advance!
Posted Last updated
.
Post marked as solved
1 Replies
423 Views
Hello. I'm working with Metal in Apple Vision Pro, and I've assumed that I can use Mesh shaders to work with Meshlets. But when creating the RenderPipeline, I get the following error message: "device does not support mesh shaders". The test is on the simulator, and my question is: Will Apple Vision Pro support Mesh shaders on physical devices? Thanks.
Posted Last updated
.
Post not yet marked as solved
2 Replies
730 Views
Hello, I am trying to use gpu for machine learning task from apple using "mps" as device for GPU but it is not working. I am using PyTorch Stable version. How can I use MacBook GPU for machine learning tasks?
Posted
by mishrr.
Last updated
.
Post not yet marked as solved
0 Replies
536 Views
with my MacBook m2. The code works correctly both on CPU and GPU, but the speed on GPU is much slower! I have loaded my statistic and my model on GPU, and it seemed to work. /Users/guoyijun/Desktop/iShot_2023-08-20_09.57.41.png I printed my code runtime. when the following function "train" is called, the loop speed among them runs extraordinarily slow. def train(net, device, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate, weight_decay, batch_size): train_ls, test_ls = [], [] train_iter = d2l.load_array((train_features, train_labels), batch_size, device) # Adam optimizer = torch.optim.Adam(net.parameters(), lr = learning_rate, weight_decay = weight_decay) for epoch in range(num_epochs): for X, y in train_iter: optimizer.zero_grad() l = loss(net(X), y) l.backward() optimizer.step() # train_ls.append(log_rmse(net, train_features, train_labels)) return train_ls, test_ls
Posted
by GPIO.
Last updated
.
Post not yet marked as solved
0 Replies
537 Views
I am in the process of developing a matrix-vector multiplication kernel. While conducting performance evaluations, I've noticed that on M1/M1 Pro/M1 Max, the kernel demonstrates an impressive memory bandwidth utilization of around 90%. However, when executed on the M1 Ultra/M2 Ultra, this figure drops to approximately 65%. My suspicion is that this discrepancy is attributed to the dual-die architecture of the M1 Ultra/M2 Ultra. It's plausible that the necessary data might be stored within the L2 cache of the alternate die. Could you kindly provide any insights or recommendations for mitigating the occurrence of on-die L2 cache misses on the Ultra chips? Additionally, I would greatly appreciate any general advice aimed at enhancing memory load speeds on these particular chips.
Posted
by lshzh.
Last updated
.