PhD, Researcher

Real time 3D hand pose estimation using MediaPipe

Most machine learning research on human body keypoints estimation deal with 2D coordinate estimations. This is because the mathematics require at least two cameras viewing the same point to triangulate a 3D coorindate. Additionally, the cameras must be calibrated and their relative orientations and positions known. This requirement is rather stringent. On the other hand, if we can detect 2D coordinates of a points from two camera views, we can simply triangulate the 3D coorindate using camera calibration parameters. In this post, I show how to do this using mediapipe’s hand keypoints detector. The code for this demo is uploaded to my repository: click here. I assume that you already have two calibrated cameras looking at the same scene and that the projection matrices are already known.

Figure 1. Multi-view inputs.

Figure 2. 3D triangulated keypoints.

Mediapipe is Google’s latest machine learning applications toolbox that allows various detections and keypoints estimations. Check their official repository here: link. We will be using the Mediapipe hand keypoints estimator (paper). The hands module uses keypoints esitmations from the previous frame to crop out image section for the next frame. This means that we have to create two hand keypoints esitmators, one for each of our cameras.

import mediapipe as mp

mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.hands

hands0 = mp_hands.Hands(min_detection_confidence=0.5, max_num_hands =1, min_tracking_confidence=0.5)
hands1 = mp_hands.Hands(min_detection_confidence=0.5, max_num_hands =1, min_tracking_confidence=0.5)

Keypoints detection is simple. We simply call the estimator with the input frame.

results0 = hands0.process(frame0)
results1 = hands1.process(frame1)

The output of mediapipe follows the same numbering the Openpose uses:

Figure 3. Keypoint numbering scheme. Source: https://cmu-perceptual-computing-lab.github.io/openpose/web/html/doc/md_doc_02_output.html

Once the keypoints are obtained from the estimator, we apply direct linear transform (DLT) through singular value decomposition (SVD). If you want to know how this works, check my blogs post here: link.

def DLT(P1, P2, point1, point2):

    A = [point1[1]*P1[2,:] - P1[1,:],
         P1[0,:] - point1[0]*P1[2,:],
         point2[1]*P2[2,:] - P2[1,:],
         P2[0,:] - point2[0]*P2[2,:]
        ]
    A = np.array(A).reshape((4,4))
    #print('A: ')
    #print(A)

    B = A.transpose() @ A
    from scipy import linalg
    U, s, Vh = linalg.svd(B, full_matrices = False)

    #print('Triangulated point: ')
    #print(Vh[3,0:3]/Vh[3,3])
    return Vh[3,0:3]/Vh[3,3]

for uv1, uv2 in zip(frame0_keypoints, frame1_keypoints):
    #uv1 is the pixel coordinate of a keypoint in camera1
    #uv2 is the pixel coordinate of a keypoint in camera2
    #P1 is the projection matrix of camera1
    #p2 is the projection matrix of camera2
    p3d = DLT(P1, P2, uv1, uv2)

That’s it. We obtain the 3D coordinate of keypoint in p3d. The hand estimator runs on the CPU in real time which allows us to obtain the 3D coordinates of the hand in real time. In my repository code, there are additional checks to make sure the keypoints are found by the estimator. But the main section is what we see above.