Segmenting a single hand from multiple available hands in hand pose estimation

Hi, I'm working on developing an app. I need my app to work in a crowd with multiple people (who have multiple hands).

From what I understand, the Vision Framework currently uses a heuristic of "largest hand" to assign as the detected hand. This won't work for my application since the largest hand won't always be the one that is of interest. In fact, the hand of interest will be the one that is pointing.

I know how to train a model using CreateML to identify a hand that is pointing, but where I'm running into issues is that there is no straightforward way to directly override the Vision framework's built-in heuristic of selecting the largest hand when you're solely relying on Swift and Create ML.

I would like my framework to be:

  1. Request hand landmarks
  2. Process image
  3. CreateML reports which hand is pointing
  4. We use the pointing hand to collect position data on the points of the index finger

But within vision's framework, if you set the number of hands to collect data for to 1, it will just choose the largest hand and report position data for that hand only. Of course, the easy work around here is to set it to X number of hands, but within the scope of an IOS device, this is computationally intensive (since my app could be handling up to 10 hands at a time).

Has anyone come up with a simpler solution to this problem or aware of something within visionOS to do it?

Replies

day 1 of bumping until an apple engineer sees this :(