MediaPipe for Dummies

A wide range of potential Machine Learning applications today rely on several fundamental baseline Machine Learning tasks. For example, both gestural navigation and sign language detectors rely on the ability of a program to identify and track human hands. Given that building something like a hand tracking model is time-consuming and resource-intensive, a developmental bottleneck exists in the creation of all applications that rely on hand tracking. To address this problem, Google invented MediaPipe.

Examples of Artificial Intelligence applications - A Disney-character selfie filter (left, source) and vehicular object detection (right, source)

MediaPipe provides cornerstone Machine Learning models for common tasks like hand tracking, therefore removing the same developmental bottleneck that exists for a host of Machine Learning applications. These models, along with their excessively easy-to-use APIs, in turn streamline the development process and reduce project lifetime for many applications that rely on Computer Vision.

In this tutorial, we'll learn how to use some of MediaPipe's Python APIs to accomplish foundational Computer Vision tasks in just a few lines of code, including facial tracking and pose extraction. We'll even see how this data can be used to power tools like facial motion capture in Blender! Let's dive in.

0:00

/0:13

Facial motion capture data in the Blender 3D animation software

Introduction

MediaPipe is a project by Google that offers "open-source, cross-platform, customizable ML solutions for live and streaming media". In other words, MediaPipe provides access to a wide variety of powerful Machine Learning models built with the hardware limitations of mobile devices in mind. These models include:

Anatomical Models
- Hand Tracking
- Pose Tracking
- Facial Mesh Tracking
- Holstic Tracking (above 3 combined)
Segmentation Models
- Seflie Segmentation
- Hair Segmentation
Object Models
- 2D Object Detection/Tracking
- 3D Object Detection and Pose Estimation

Let's take a look at how to employ some of these APIs in Python now!

Getting Started

Before we can get to working with MediaPipe itself, we'll need to perform some preliminary installations / imports. If you just want to follow along without having to do this, you can instead open up this Colab notebook and move on to the next section.

Installations

MediaPipe is available for C++, Android, and more; but, in this tutorial, we will be working only with Python. To install MediaPipe for Python, simply pip install it to your desired environment:

pip install mediapipe

If you wish to follow along with this tutorial exactly, you can skip the above step of installing MediaPipe directly and instead pip install from the requirements file in the associated GitHub. To do this, navigate into the directory in which you would like to clone the project folder and execute the following commands in a command prompt/terminal:

Windows

git clone https://github.com/AssemblyAI-Examples/mediapipe-python.git
cd mediapipe-python
python3 -m venv mp_env
mp_env\Scripts\activate.bat
pip install -r requirements.txt

UNIX-like systems

git clone https://github.com/AssemblyAI-Examples/mediapipe-python.git
cd mediapipe-python
python3 -m venv mp_env
source mp_env/bin/activate
pip install -r requirements.txt

Then, open up the notebook.ipynb Jupyter notebook to follow along with the code, (note that the mp_env virtual environment will need to be added to Jupyter as a kernel).

Imports

Next, we finish the setup with some Python imports and variable declarations:

import cv2
import mediapipe as mp
import urllib.request
import numpy as np
import pickle
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import animation
import PyQt5
from PIL import Image
from IPython.display import Video
import nb_helpers

mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles
mp_holistic = mp.solutions.holistic
mp_pose = mp.solutions.pose
mp_face_mesh = mp.solutions.face_mesh

Face Mesh and Motion Capture with MediaPipe

To begin working with MediaPipe, we'll first explore its Face Mesh API and see how it is used as a backend to power applications like facial motion capture in Blender.

Image Fetching

First, fetch the image with the following command:

face_url = "https://1vw4gb3u6ymm1ev2sp2nlcxf-wpengine.netdna-ssl.com/wp-content/uploads/shutterstock_149962697-946x658.jpg"
urllib.request.urlretrieve(face_url, "face_image.jpg")

img = Image.open('face_image.jpg')
display(img)

This will download the below image as face_image.jpg

Let's see how we can use MediaPipe to analyze this image!

Face Mesh Processing

We'll use MediaPipe's FaceMesh object to extract a 3D mesh corresponding to the surface of the face in the image.

# Define image filename and drawing specifications
file = 'face_image.jpg'
drawing_spec = mp_drawing.DrawingSpec(thickness=1, circle_radius=1)

# Create a face mesh object
with mp_face_mesh.FaceMesh(
        static_image_mode=True,
        max_num_faces=1,
        refine_landmarks=True,
        min_detection_confidence=0.5) as face_mesh:

    # Read image file with cv2 and process with face_mesh
    image = cv2.imread(file)
    results = face_mesh.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

# Define boolean corresponding to whether or not a face was detected in the image
face_found = bool(results.multi_face_landmarks)

Additional Details

The arguments of FaceMesh provide additional information that will be used when processing our image.

static_mode=True specifies that we will be processing a static image with this object (in contrast to a video stream).
max_num_faces=1 specifies that the FaceMesh will detect at most one face.
refine_landmarks=True applies an attention mesh model to refine landmark coordinates around eyes and lips, and output additional landmarks around the irises.
Finally, min_detection_confidence=0.5 set a lower bound for the detection confidence required to return an identified face. This value corresponds to a probability and its value therefore must lie in [0.0, 1.0].

Reading in the image with OpenCV (cv2) yields BGR color channels, so the image is first converted to RGB before being processed by face_mesh.process()

Note that the below line is where the magic happens:

results = face_mesh.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

All processing is performed with one simple line. After processing, we have highly accurate results that did not require any data gathering, data preparation, model definition, model tuning, model testing, etc. This is the value of MediaPipe - it provides easy access to several very powerful Machine Learning models that produce easy-to-parse results. Let's take a look at what we can do with these results now.

Drawing Face Mesh Tessellation

We can use MediaPipe's drawing utilities to visualize our results on top of our images.

if face_found:
    # Create a copy of the image
    annotated_image = image.copy()
    
    # Draw landmarks on face
    mp_drawing.draw_landmarks(
        image=annotated_image,
        landmark_list=results.multi_face_landmarks[0],
        connections=mp_face_mesh.FACEMESH_TESSELATION,
        landmark_drawing_spec=None,
        connection_drawing_spec=mp_drawing_styles
            .get_default_face_mesh_tesselation_style())
        
    # Save image
    cv2.imwrite('face_tesselation_only.png', annotated_image)

# Open image
img = Image.open('face_tesselation_only.png')
display(img)

Additional Details

The arguments of draw_landmarks() provide additional information that is needed to display our tesselation.

image=annotated_image specifies the image that we would like to map the tesselation onto.
landmark_list=results.multi_face_landmarks[0] specifies the location data for the landmarks that will be drawn onto the image. results.multi_face_landmarks is a list containing MediaPipe NormalizedLandmarkLists - one for each face. Since out image has only one face, we access the sole NormalizedLandmarkList element of results.multi_face_landmarks by specifying it's index ([0]), and ultimately set this list as the landmark_list parameter.
NormalizedLandmarkLists are ordered lists, and the mp_face_mesh.FACEMESH_TESSELATION object is a frozen set defining how the points in these lists connect. We specify these connections with connections=mp_face_mesh.FACEMESH_TESSELATION.
landmark_drawing_spec=drawing_spec uses the drawing specification we set above to draw a small green circle at each landmark.
Finally, connection_drawing_spec=mp_drawing_styles .get_default_face_mesh_tesselation_style() uses the default drawing specification for the connections between the landmarks.

Face image with MediaPipe Face Mesh drawn on top

Drawing Face Mesh Contours and Irises

Our results object does not contain only the facial tesselation information. The contours and irises of the face are also identified and can be drawn on separately. Let's drawn both on the original image now:

if face_found:
    # Create a copy of the image
    annotated_image = image.copy()
    
    # For each face in the image (only one in this case)
    for face_landmarks in results.multi_face_landmarks:
        
        # Draw the facial contours of the face onto the image
        mp_drawing.draw_landmarks(
            image=annotated_image,
            landmark_list=face_landmarks,
            connections=mp_face_mesh.FACEMESH_CONTOURS,
            landmark_drawing_spec=None,
            connection_drawing_spec=mp_drawing_styles
                .get_default_face_mesh_contours_style())
        
        # Draw the iris location boxes of the face onto the image
        mp_drawing.draw_landmarks(
            image=annotated_image,
            landmark_list=face_landmarks,
            connections=mp_face_mesh.FACEMESH_IRISES,
            landmark_drawing_spec=None,
            connection_drawing_spec=mp_drawing_styles
                .get_default_face_mesh_iris_connections_style())

	# Save the image
    cv2.imwrite('face_contours_and_irises.png', annotated_image)

The details of this code are very similar to the tessellation code above, so see the "Additional Details" dropdown in the section above to learn more.

Face Mesh Motion Capture with Blender

Simply drawing our results onto the image is all well and good, but what if we want to actually do something with this data. In addition to processing images, MediaPipe can also process videos to acquire time-sequential XYZ spatial facial mesh data. We can then export this data, effectively can using MediaPipe as a "backend" to compute motion capture data. BlendArMocap, an open-source project for the Blender 3D graphics and animation software, does exactly this.

Here are several views of 3D facial mesh data imported into Blender using BlendArMocap:

0:00

/0:13

Facial tracking data imported into Blender

This data can be used to drive a "rig", which defines (in this case) a humanoid anatomy. From here, a character model can be mapped onto the rig in order to animate a character model in 3D.

0:00

/0:19

Blender rig driver by MediaPipe facial tracking data

It is difficult to overstate how impressive this process is - a sequence of 2D image data, which is simply many arrays of RGB values, is processed and a human face is automatically recognized. Salient features of the face are then identified and tracked through time, extracting 3D data at each frame. The data is then used to animate character movements for potentially photo-realistic rendering that can incorporate ray-tracking, VFX, etc.

This entire process is powered by MediaPipe and OpenCV.

While using open-source tools like BlendArMocap is great, some users may want to process videos and work with raw data for their own purposes. Let's check out how to do this now:

Pose Detection and Segmentation with MediaPipe

MediaPipe's Pose Detection is similar to Face Mesh except for the fact that it, unsurprisingly, identifies human poses rather than faces. Let's explore this capability now.

Pose Processing and Drawing

First, as before we download the image we will be using for processing:

img_url = "https://static.turbosquid.com/Preview/2015/11/10__11_56_36/anthonystanding23dmetry3dhuman01.jpg5e774d4d-9b9e-456d-9d7b-fc4d741cf940Large.jpg"
urllib.request.urlretrieve(img_url, "pose.jpg")

img = Image.open('pose.jpg')
display(img)

We can again process this image with a MediaPipe object, this time using a Pose instance, and draw the detected pose on the image. The details of this step are effectively isomorphic to the Face Mesh case, so see the above section for a more in-depth explanation.

# Specify the image filename
file = 'pose.jpg'

# Create a MediaPipe `Pose` object
with mp_pose.Pose(static_image_mode=True, 
		  model_complexity=2,
                  enable_segmentation=True) as pose:
        
    # Read the file in and get dims
    image = cv2.imread(file)

    # Convert the BGR image to RGB and then process with the `Pose` object.
    results = pose.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

# Copy the iamge
annotated_image = image.copy()

# Draw pose, left and right hands, and face landmarks on the image with drawing specification defaults.
mp_drawing.draw_landmarks(annotated_image, 
                          results.pose_landmarks, 
                          mp_pose.POSE_CONNECTIONS,
                          landmark_drawing_spec=mp_drawing_styles.get_default_pose_landmarks_style())

# Save image with drawing
filename = "pose_wireframe.png"
cv2.imwrite(filename, annotated_image)

# Open image
display(Image.open(filename))

Image Segmentation

Included in our results is the information required to segment our image. Each pixel of the image is given a probability of being within the object whose pose has been identified. We simply need to set a probability threshold and isolate all pixels that are above this threshold to isolate the human. Let's do this now - we'll replace the background with a green screen.

# Copy the image
segmented_image = image.copy()

# Probability threshold in [0, 1] that says how "tight" to make the segmentation. Greater value => tighter.
tightness = .3

# Stack the segmentation mask for 3 RGB channels, and then create a filter for which pixels to keep
condition = np.stack((results.segmentation_mask,) * 3, axis=-1) > tightness

# Creates a black background image
bg_image = np.zeros(image.shape, dtype=np.uint8)

# Can change the color of this background by specifying (0-255) RGB values. We choose green-screen green.
bg_image[:] = [4, 244, 4]

# For every pixel location, display the corresponding pixel from the original imgae if the condition in our filter is met (i.e. the probability of being part of the object is above the inclusiogn threshold), or else display corresponding pixel from the background array (i.e. green)
segmented_image = np.where(condition, segmented_image, bg_image)

filename = "pose_green_screen.png"
cv2.imwrite(filename, segmented_image)
display(Image.open(filename))

Exporting Raw Data from MediaPipe

Exporting the raw XYZ data from MediaPipe is very easy, but we'll need some additional information to plot the data in a comprehensible way.

In particular, it is important to understand that MediaPipe stores landmarks in a consistently ordered way. That is, mp.solutions.pose.PoseLandmark specifies which XYZ coordinates in results.pose_landmarks (the output after processing an image) correspond to which body landmark (nose, right elbow, etc.). Note that the order of the body landmarks in this list is arbitrary but consistent within MediaPipe from run-to-run. You can access a more readable version of the ordered list of anatomical landmarks with nb_helpers.poselandmarks_list.

MediaPipe's POSE_CONNECTIONS object is a frozen set that contains tuples defining how these landmarks map to each other. For example, the tuple (0, 4) is in POSE_CONNECTIONS, indicating that the zeroth and fourth spatial points in results.pose_landmarks connect with each other (corresponding to "NOSE" and the "LEFT_OUTER_EYE" in mp.solutions.pose.PoseLandmark - the zeroth and fourth body landmarks respectively). Let's print out a few of these connections to make this idea concrete:

poselandmarks_list = nb_helpers.poselandmarks_list

num = 0
for i in mp_holistic.POSE_CONNECTIONS:
    if num < 5:
        print(poselandmarks_list[i[0]], '-->', poselandmarks_list[i[1]])
    else:
        break
    num += 1

LEFT_WRIST --> LEFT_THUMB
RIGHT_WRIST --> RIGHT_INDEX
RIGHT_PINKY --> RIGHT_INDEX
LEFT_EYE_OUTER --> LEFT_EAR
RIGHT_ELBOW --> RIGHT_WRIST

This point having been understood, we are ready to handle the raw MediaPipe spatial data. Let's save the above pose data as a NumPy array and then display it

# Create a 3x33 array to store XYZ data for 33 landmarks
data = np.empty((3, len(mp_holistic.PoseLandmark)))

# Store the XYZ data for each landmark
landmarks = results.pose_world_landmarks.landmark
for i in range(len(mp_holistic.PoseLandmark)):
    data[:, i] = (landmarks[i].x, landmarks[i].y, landmarks[i].z)   

# Plot the data
fig = plt.figure()
fig.set_size_inches(5, 5, True)
ax = fig.add_subplot(projection='3d')

nb_helpers.plot_data(data, ax)
nb_helpers.scale_axes(ax)

# Save a rotation animation of the data
filename = 'pose_rotation.mp4'
nb_helpers.rotate_and_save(fig, ax, filename, save=True)

Additional Details

The nb_helpers.plot_data() function simply plots each of the landmark points and then iterates throught POSE_CONNECTIONS to draw lines between the related points. The data is also rotated to be better oriented for plotting with PyPlot.
The nb_helpers.scale_axes() function scales the axes to plot the data with equal unit length along different axes.
Finally, the nb_helpers.rotate_and_save() function creates a rotation animation and saves it as an .mp4 so that we can observe our data.

All of these functions are available in the nb_helpers.py file on GitHub, but the specifics of their implementations are not within the purview of this article and so details have been omitted.

0:00

/0:12

We see that we have successfully plotted the raw MediaPipe spatial data and created an animation to observe it. You can save the data NumPy array in whatever export format is required by your downstream task.

Working with Video Data in MediaPipe

The last aspect of MediaPipe that we will look at in-depth is working with video data. We will work with saved video files in particular, but the same principles apply to real-time camera data.

We will use MediaPipe to extract time-sequential poses of a man walking:

# Download the necessary video
url = 'https://github.com/AssemblyAI-Examples/mediapipe-python/blob/main/walking.mp4?raw=true'
urllib.request.urlretrieve(url, 'walking.mp4') 

# Specify the video filename and create a `Pose` object as before
file = 'walking.mp4'
with mp_pose.Pose(min_detection_confidence=0.5, min_tracking_confidence=0.5) as pose:

	# Create VideoCapture object
    cap = cv2.VideoCapture(file)

	# Raise error if file cannot be opened
    if cap.isOpened() == False:
        print("Error opening video stream or file")
        raise TypeError

	# Get the number of frames in the video
    length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    
    # Create a NumPy array to store the pose data as before
    # The shape is 3x33x144 - 3D XYZ data for 33 landmarks across 144 frames
    data = np.empty((3, len(poselandmarks_list), length))    
    
	# For each image in the video, extract the spatial pose data and save it in the appropriate spot in the `data` array 
    frame_num = 0
    while cap.isOpened():
        ret, image = cap.read()
        if not ret:
            break

        image = cv2.cvtColor(cv2.flip(image, 1), cv2.COLOR_BGR2RGB)
        results = pose.process(image)
        
        landmarks = results.pose_world_landmarks.landmark
        for i in range(len(mp_pose.PoseLandmark)):
            data[:, i, frame_num] = (landmarks[i].x, landmarks[i].y, landmarks[i].z)  
        
        frame_num += 1
    
    # Close the video file
    cap.release()

The shape of the data is (3, 33, 144) - 3-dimensional XYZ data for 33 anatomical locations across 144 frames. Again, we plot the data and animate it, except in this instance we animate over time with another helper function. We also plot the original video. Note that the plotted data looks mirrored - this is because the pose detection works on images that were captured by a video frame,

fig = plt.figure()
fig.set_size_inches(5, 5, True)
ax = fig.add_subplot(projection='3d')

anim = nb_helpers.time_animate(data, fig, ax)

# Save
anim.save('walking_wireframe.mp4', fps=30, extra_args=['-vcodec', 'libx264'], dpi=300)

0:00

/0:04

Non-Python MediaPipe Solutions

As mentioned previously, MediaPipe has a host of other solutions with varying availability across Android, iOS, C++, JavaScript, and Coral.

Iris

MediaPipe Iris is a solution for accurate iris estimation, and is able to track iris, pupil, and eye contour landmarks with just a camera. This task is generally difficult due to variable light conditions and occlusions, especially in a limited-compute environment like a smartphone. MediaPipe Iris offers a solution that overcomes these hurdles.

Hair Segmentation

MediaPipe Hair Segmentation is just like Selfie Segmentation, except it segments only a person's hair. Such a solution can be used to power augmented reality applications such as SnapChat Lens.

Object Detection

MediaPipe Object Detection is similar to Objectron, except for the face that it does not do pose estimation and instead identifies only bounding boxes. This solution is more lightweight and is therefore preferred for applications where object pose is not required.

Box Tracking

MediaPipe Box Tracking is similar to object detection, except it provides much more robust bounding box capabilities. Box Tracking records bounding boxes with time stamps.

Instant Motion Tracking

MediaPipe Instant Motion Tracking is valuable for augmented reality applications, and the uses for low-cost VFX are immediately obvious. Motion Tracking is actually built on top of Box Tracking.

KNIFT

MediaPipe KNIFT, or Keypoint Neural Invariant Feature Transform, is a template-based solution for feature matching. Being able to understand spatial relationships from different perspectives is a difficult problem and lies at the foundation of a wide variety of applications. KNIFT is a general-purpose solution for this problem, and is robust to changes in illumination, orientation, and scaling.

AutoFlip

AutoFlip is a simple solution that allows for the smart cropping of videos to arbitrary aspect ratios.

Final Words

MediaPipe is a very powerful tool that can facilitate the development of apps built on a number of Machine Learning capabilities. MediaPipe's exceedingly simple API provides access to several State-of-the-Art (SOTA) models built specifically with the hardware limitations of mobile computing in mind.

While MediaPipe provides many great solutions for Computer Vision and Visual Intelligence, there are many applications today that rely instead on Audio Intelligence. AssemblyAI's easy-to-use APIs provide access to a range of SOTA models for Audio Intelligence, just as MediaPipe provides access to SOTA Visual Intelligence models. Grab an API token below for free access to a host of Audio Intelligence solutions for Sentiment Analysis, Entity Detection, PII Redaction, and more.

Get a Free API Token

MediaPipe for Dummies

Introduction

Getting Started

Installations

Imports

Face Mesh and Motion Capture with MediaPipe

Image Fetching

Face Mesh Processing

Drawing Face Mesh Tessellation

Drawing Face Mesh Contours and Irises

Face Mesh Motion Capture with Blender

Pose Detection and Segmentation with MediaPipe

Pose Processing and Drawing

Image Segmentation

Exporting Raw Data from MediaPipe

Working with Video Data in MediaPipe

More MediaPipe Solutions for Python

Face Detection

Hand Tracking

Holistic Tracking

Selfie Segmentation

Objectron

Non-Python MediaPipe Solutions

Iris

Hair Segmentation

Object Detection

Box Tracking

Instant Motion Tracking

KNIFT

AutoFlip

Final Words

Popular posts

AI trends in 2024: Graph Neural Networks

AI for Universal Audio Understanding: Qwen-Audio Explained

Combining Speech Recognition and Diarization in one model

How DALL-E 2 Actually Works