The Internet, 2024/06/26

Introduction

Target: cropping a horizontal video into a vertical one, following an object that moves.

To follow the subject we will use a tracker, and we will keep the subject in the middle of the frame.

To do that we used python, OpenCV2, and ffmpeg.

Download and prepare a video (optional)

To start let’s say we want to take this video from YouTube:

Funny Cats Compilation

And in particular, we want to use the first 15 seconds

To download we can use yt-dlp

$ yt-dlp https://www.youtube.com/watch?v=XPQlMmMDm-A -f "bestvideo/best"  --merge-output-format webm -o "XPQlMmMDm-A.webm"

We are downloading the video in the best quality available and in the webm format (which is in fact a wrapper of vp9).

We are not downloading the audio, as is not relevant for this tutorial.

Cutting the video

We want to use the first 15 seconds. We can use ffmpeg to cut the video.

$ ffmpeg -i "XPQlMmMDm-A.webm" -ss 00:00:00 -to 00:00:15 "XPQlMmMDm-A-cut.mp4"

Here we are not just cutting the video, we are also converting in the same codec (vp9).

If you want to avoid this first encoding (which is not a big deal, as it’s basically lossless, just time-consuming), you can use this command

$ ffmpeg -i "XPQlMmMDm-A.webm" -ss 00:00:00 -to 00:00:15  -avoid_negative_ts 1 -c copy "XPQlMmMDm-A-cut.mp4"

This will cut the video without re-encoding it (-c copy), but it will cut the video using the key frames that are closer to the desired time (-avoid_negative_ts 1). It’s faster but less precise.

Result

And this is the extracted video:

Coding the object tracker

We will use python, and OpenCV2 to track the object. I’m not an expert of trackers, and I’ve found them less smart than I thought. But they are still useful.

Choosing the tracker

OpenCV2 offers a variety of trackers, In this article you can find a comparison

I’ve chosen the TrackerCSRT because it’s the most accurate, especially if you know that the object is always in the frame.

Let’s start with the code

Initialization

This is not the full code, to see the full code you can check the GitHub repository.

def main():
    arguments = ProgramArguments()  # read terminal arguments
    vs = cv2.VideoCapture(arguments.file)  # read video file
    frame_0 = vs.read()[1]  # read first frame
    total_frames = int(vs.get(cv2.CAP_PROP_FRAME_COUNT))  # get total frames
    rectangles: {int: Rectangle} = {}  # we initialize a dictionary to store the rectangles
    frame_height, frame_width = frame_0.shape[:2]  # this is the size of the original video
    final_width = int(frame_height / 16 * 9)  # this is the width of the final video
    final_height = frame_height  # We will keep the height of the final video the same as the original
    tracker = cv2.TrackerCSRT.create()  # initialize the tracker
    roi_found = False  # we will use this variable to check if the roi (region of interest) is found
    cur_frame_number = 0
    vs.set(cv2.CAP_PROP_POS_FRAMES, cur_frame_number)  # We reset the video to the first frame

Reading the frame and some embellishments

We resize the frame to make it faster to process, and we convert it to gray if needed. Also, we display the previous rectangle and the final crop so that we can better understand what the tracker is doing.

    while cur_frame_number < total_frames:  # until we reach the end of the video
        frame = vs.read()[1]  # read the frame
        if frame is None:  # sometimes the frame is None before the end
            break
        # we reduce the size of the frame to make it faster to process
        resized_frame = imutils.resize(frame, width=int(frame_width / arguments.ratio))
        # sometimes it's better to work with gray images
        if arguments.gray:
            resized_frame = cv2.cvtColor(resized_frame, cv2.COLOR_BGR2GRAY)
        if cur_frame_number - 1 > 0 and rectangles.get(cur_frame_number - 1, None) is not None:
            # to better understand the tracker we display the tracked rectangle of the previous frame
            rectangle = rectangles[cur_frame_number - 1]
            resized_rectangle = rectangle.scale(1 / arguments.ratio)
            final_rectangle = resized_rectangle.to_final_frame_cut(final_width, final_height)
            # this is the previous roi
            cv2.rectangle(resized_frame, resized_rectangle.point1(), rectangle.point2(), prev_frame_color, 2)
            # this is the final frame crop
            cv2.rectangle(resized_frame, final_rectangle.point1(), final_rectangle.point2(), final_crop_color, 10)

Selecting the ROI

The ROI (Region of Interest) is the rectangle that we want to track. We are selecting it manually (it could be automated, but normally only you know what you want to track).

        # ... in the same loop
        if not roi_found:  # the tracker has lost the roi, or it's the first frame
            roi = cv2.selectROI(arguments.file, resized_frame, fromCenter=False, )  # we select the roi
            rectangles[cur_frame_number] = Rectangle.from_roi(roi, cur_frame_number)  # we store the roi
            tracker.init(resized_frame, roi)  # we initialize the tracker
            roi_found = True
        else:
            (roi_found, box) = tracker.update(resized_frame)  # we update the tracker
            if roi_found:  # if the tracker has found the roi
                (x, y, w, h) = [int(v) for v in box]
                cv2.rectangle(resized_frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
                rectangles[cur_frame_number] = Rectangle(x, x + w, y, y + h, cur_frame_number)

Displaying the frame

Each frame we display it, and we save the rectangle in the dictionary.

        # ... in the same loop
        cv2.putText(resized_frame, "Frame: {}".format(cur_frame_number), (10, 20),
            cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2) # we add some text to debug
        cv2.imshow(arguments.file, resized_frame) # we display the frame
        cv2.waitKey(1) & 0xFF # this is just to make the display work
        cur_frame_number += 1

Result up to now

We can now test the code. We will see a window where we can select the ROI, and then we will see the tracker in action.

$ python3 tracker.py -f XPQlMmMDm-A-cut.mp4
Partial result

The green is the object found, the red represents the final output. At this stage we are not cropping the video, we are just tracking the object. The result is not perfect, but let’s proceed with the actual cropping.

Cropping the video

To crop with ffmpeg we will create a script that will swap the rectangle that we want to keep with the left part of the video, for each frame. The template of that script is the following:

swaprect=%s:%s:%s:0:%s:%s:enable='between(n,%s,%s)'

Things get a bit tricky because the swap doesn’t work if there is an overlapping of the rectangles. So we need to split the rectangles in smaller parts, such that they don’t overlap.

and this is the code that produce the right line for each center of the rectangle

def output(center_x, center_y, start, end, frame_width, frame_height, width, height):
    x2 = int(center_x - width / 2) # the top left corner of the rectangle
    y2 = int(center_y - height / 2) 
    if y2 < 0: # we don't want to go out of the frame
        y2 = 0
    if y2 + height > frame_height: # we don't want to go out of the frame
        y2 = frame_height - height
    if x2 <= 0: 
        x2 = int(width / 2) # if the x is out of the frame is tricky because we can't swaprect with negative values
    if x2 < width: # if the x is less than the width we need to split the rectangles to avoid overlappings`
        output = ""
        i = 0
        last = 0
        while (i + 1) * x2 <= width + x2: # just trust me on this, ok?
            output += "swaprect=%s:%s:%s:0:%s:%s:enable='between(n,%s,%s)',\n" % (
                x2, height, i * x2, (i + 1) * x2, y2, start, end)
            last = i * x2
            i += 1
        rest = width - last
        output += "swaprect=%s:%s:%s:0:%s:%s:enable='between(n,%s,%s)',\n" % (
            rest, height, i * x2, i * x2 + rest, y2, start, end)

        return output
    elif x2 + width > frame_width:
        x2 = frame_width - width
    return "swaprect=%s:%s:0:0:%s:%s:enable='between(n,%s,%s)',\n" % (width, height, x2, y2, start, end)

Smoothing the motion

The tracker is moving every frame, especially short movements. We will smooth the motion using a gaussian filter.

def smooth_centers(centers):
    # apply a gaussian filter to the centers
    from scipy.ndimage import gaussian_filter1d
    xs = np.array(list(map(lambda x: x[0], centers)))  # we don't care about y
    new_centers = gaussian_filter1d(xs, sigma=5)
    result = []
    for i in range(len(new_centers)):
        result.append((new_centers[i], centers[i][1], centers[i][2]))
    return result

Generating the script

Now that we have all the pieces we can generate the script that will crop the video.


def output(center_x, center_y, start, end, frame_width, frame_height, width, height):
    center_x = int(center_x)
    center_y = int(center_y)
    frame_width = int(frame_width)
    frame_height = int(frame_height)
    x2 = int(center_x - width / 2)
    y2 = int(center_y - height / 2)
    if y2 < 0:
        y2 = 0
    if y2 + height > frame_height:
        y2 = frame_height - height
    if x2 <= 0:
        x2 = int(width / 2)
    if x2 < width:
        output = ""
        i = 0
        last = 0
        while (i + 1) * x2 <= width + x2:
            output += "swaprect=%s:%s:%s:0:%s:%s:enable='between(n,%s,%s)',\n" % (
                x2, height, i * x2, (i + 1) * x2, y2, start, end)
            last = i * x2
            i += 1
        rest = width - last
        output += "swaprect=%s:%s:%s:0:%s:%s:enable='between(n,%s,%s)',\n" % (
            rest, height, i * x2, i * x2 + rest, y2, start, end)

        return output
    elif x2 + width > frame_width:
        x2 = frame_width - width
    return "swaprect=%s:%s:0:0:%s:%s:enable='between(n,%s,%s)',\n" % (width, height, x2, y2, start, end)


def write_to_file(centers, frame_width, frame_height, args: ProgramArguments):
    with open(args.output, "w") as file:
        total = len(centers)
        i = 0
        prev_i = 0
        height = frame_height
        width = int(height * 9 / 16) + 1
        while i < total:
            if centers[i] is None:
                i += 1
                continue
            center_x, center_y, frame_i = centers[i]
            try:
                prev_center_x, prev_center_y, prev_frame_i = centers[prev_i]
            except:
                prev_i = i
                prev_center_x, prev_center_y, prev_frame_i = centers[i]

            changed = abs(center_x - prev_center_x) > args.delta or abs(center_y - prev_center_y) > args.delta
            if changed:
                delta_frames = frame_i - prev_i
                center_step_x = (center_x - prev_center_x) / delta_frames
                center_step_y = (center_y - prev_center_y) / delta_frames
                for j in range(prev_i, frame_i + 1):
                    prev_center_x += center_step_x
                    prev_center_y += center_step_y
                    file.write(output(prev_center_x, 0, j, j, frame_width, frame_height, width, height))
                prev_i = frame_i + 1
            i = i + 1
        file.write(output(center_x, center_y, prev_i, total * 2, frame_width, frame_height, width, height))
        file.write("crop=%s:%s:0:0,\n" % (width, height))

The Result

We can now run the script to crop the video

$ python3 tracker.py -f XPQlMmMDm-A-cut.mp4 -o crop.sh
$ ffmpeg -nostats -hide_banner -loglevel error -y -ss "00:00:00" -i XPQlMmMDm-A-cut.mp4 -an -filter_script:v:0 crop.sh result.mp4

Optimization

Todo (I’m lazy, full script will be available on GitHub)