PhD, Researcher

Real time 3D body pose estimation using MediaPipe

In this post, I show how to obtain 3D body pose using mediapipe’s pose keypoints detector and two calibrated cameras. The code for this demo is uploaded to my repository: click here. I assume that you already have two calibrated cameras looking at the same scene and that the projection matrices are already known. If you need a guide on how to calibrate stereo cameras, check my pose here: click here.

Figure 1. Multi-view inputs.

Figure 2. 3D triangulated keypoints.

Mediapipe is Google’s latest machine learning applications toolbox that allows various detections and keypoints estimations. Check their official repository here: link. We will be using the Mediapipe pose keypoints estimator (paper). The body pose module tracks the person from one frame to another to obtain bounding box estimates. This means that we have to create two pose keypoints esitmators, one for each camera.

import mediapipe as mp

mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles
mp_pose = mp.solutions.pose


pose0 = mp_pose.Pose(min_detection_confidence=0.5, min_tracking_confidence=0.5)
pose1 = mp_pose.Pose(min_detection_confidence=0.5, min_tracking_confidence=0.5)

Keypoints detection is simple. We simply call the estimator with the input frame.

results0 = pose0.process(frame0)
results1 = pose1.process(frame1)

The output of mediapipe pose has the following indices:

Figure 3. Keypoint numbering scheme. Source: https://google.github.io/mediapipe/images/mobile/pose_tracking_full_body_landmarks.png

Once the keypoints are obtained from the estimator, we apply direct linear transform (DLT) through singular value decomposition (SVD). If you want to know how this works, check my blogs post here: link.

def DLT(P1, P2, point1, point2):

    A = [point1[1]*P1[2,:] - P1[1,:],
         P1[0,:] - point1[0]*P1[2,:],
         point2[1]*P2[2,:] - P2[1,:],
         P2[0,:] - point2[0]*P2[2,:]
        ]
    A = np.array(A).reshape((4,4))
    #print('A: ')
    #print(A)

    B = A.transpose() @ A
    from scipy import linalg
    U, s, Vh = linalg.svd(B, full_matrices = False)

    #print('Triangulated point: ')
    #print(Vh[3,0:3]/Vh[3,3])
    return Vh[3,0:3]/Vh[3,3]

for uv1, uv2 in zip(frame0_keypoints, frame1_keypoints):
    #uv1 is the pixel coordinate of a keypoint in camera1
    #uv2 is the pixel coordinate of a keypoint in camera2
    #P1 is the projection matrix of camera1
    #p2 is the projection matrix of camera2
    p3d = DLT(P1, P2, uv1, uv2)

That’s it. We obtain the 3D coordinate of keypoint in p3d. The pose estimator runs on the CPU in real time which allows us to obtain the 3D coordinates of the pose in real time. In my repository code, there are additional checks to make sure the keypoints are found by the estimator. But the main section is what we see above.