高通手机跑AI系列之——姿态识别

呼延冰枫 发表于 2025-9-26 10:56:49

(原创作者@CSDN_伊利丹~怒风)
环境准备

手机

测试手机型号：Redmi K60 Pro
处理器：第二代骁龙8移动--8gen2
运行内存：8.0GB ，LPDDR5X-8400，67.0 GB/s
摄像头：前置16MP+后置50MP+8MP+2MP
AI算力：NPU 48Tops INT8 && GPU 1536ALU x 2 x 680MHz = 2.089 TFLOPS
提示：任意手机均可以，性能越好的手机速度越快
软件

APP：AidLux 2.0
系统环境：Ubuntu 20.04.3 LTS
提示：AidLux登录后代码运行更流畅，在代码运行时保持AidLux APP在前台运行，避免代码运行过程中被系统回收进程，另外屏幕保持常亮，一般息屏后一段时间，手机系统会进入休眠状态，如需长驻后台需要给APP权限。
算法Demo

Demo代码介绍

这段代码是一个姿态检测模型的实时姿态估计应用，它使用了两个模型级联工作：一个用于检测人体，另一个用于在检测到的人体上识别关键点。下面是添加了详细中文注释的代码：
代码功能特点介绍

[*]双模型级联处理：使用两个模型协同工作，第一个模型负责检测人体，第二个模型负责在检测到的人体上识别详细的关键点。
[*]自适应摄像头选择：代码会自动检测并优先使用 USB 摄像头，如果没有 USB 摄像头，则会尝试使用设备内置摄像头。
[*]图像处理优化：

[*]图像预处理包括调整大小、填充和归一化
[*]保持原始图像的宽高比，避免变形
[*]支持图像水平翻转，使显示更符合用户习惯

[*]高性能推理：

[*]使用 aidlite 框架进行模型推理
[*]姿态检测模型使用 CPU 加速
[*]关键点识别模型使用 GPU 加速
[*]多线程支持，提高处理效率

[*]精确的姿态关键点识别：

[*]检测人体 22 个关键点（上半身模型）
[*]支持关键点连接，形成完整的姿态骨架
[*]提供置信度阈值过滤，确保检测准确性

[*]灵活的 ROI 提取：

[*]基于检测结果动态提取感兴趣区域
[*]支持旋转不变性，即使人体倾斜也能准确提取
[*]自动调整 ROI 大小，适应不同距离的人体

[*]直观的可视化：

[*]清晰显示检测到的人体边界框
[*]绘制关键点和连接线，形成直观的姿态骨架
[*]支持自定义颜色和大小，便于区分不同姿态

[*]鲁棒的错误处理：

[*]摄像头打开失败自动重试
[*]模型加载和推理错误检测
[*]异常情况优雅处理，确保程序稳定运行

这个应用可以用于多种场景，如健身指导、动作分析、人机交互等，通过识别和跟踪人体关键点，可以实时分析人体姿态并提供反馈。
Demo中的算法模型分析

这段代码使用了两个模型用AidLite 框架进行人体姿态检测和关键点识别，它们分别是：
1. 姿态检测模型 (pose_detection.tflite)

[*]作用：从输入图像中检测人体的大致位置和姿态。
[*]输入：128×128 像素的 RGB 图像。
[*]输出：包含边界框和关键点的预测结果（896 个候选框，每个框有 12 个坐标值）。
[*]特点：

[*]轻量级设计，适合实时处理。
[*]使用锚框机制提高检测精度。
[*]输出人体的粗略位置和关键点（如眼睛、耳朵、肩膀等）。
[*]采用 CPU 加速，平衡性能与精度。

2. 上半身姿态关键点模型(pose_landmark_upper_body.tflite)

[*]作用：在检测到的人体区域内，精确识别上半身的 22 个关键点。
[*]输入：256×256 像素的 RGB 图像（ROI 区域）。
[*]输出：31 个关键点的坐标（每个点包含 x、y、z 坐标和可见性）。
[*]特点：

[*]高精度识别肩部、肘部、手腕等关节位置。
[*]使用 GPU 加速，提升复杂模型的推理速度。
[*]支持多角度和遮挡场景下的姿态估计。
[*]输出每个关键点的置信度，用于过滤不可靠的检测结果。

模型协同工作流程

[*]姿态检测：先使用第一个模型快速定位人体位置。
[*]ROI 提取：基于检测结果裁剪并旋转感兴趣区域（ROI）。
[*]关键点识别：将 ROI 输入第二个模型，获取精细的上半身关键点。
[*]坐标映射：将归一化的关键点坐标映射回原始图像空间。
这种级联模型的设计兼顾了效率和精度，适合实时视频流处理。
Demo代码

import math
import numpy as np
from scipy.special import expit
import time
from time import sleep
import aidlite
import os
import subprocess
import aidcv as cv2继续展开代码# 摄像头设备路径
root_dir = "/sys/class/video4linux/"
<p>def resize_pad(img):<br>
"""<br>
调整图像大小并填充，使其适合检测器输入</p>
人脸和手掌检测器网络分别需要256x256和128x128的输入图像。
此函数会保持原始图像的宽高比进行缩放，并在需要时添加填充。

返回值:
img1: 256x256大小的图像
img2: 128x128大小的图像
scale: 原始图像与256x256图像之间的缩放因子
pad: 原始图像中添加的填充像素
"""

size0 = img.shape
if size0>=size0:
h1 = 256
w1 = 256 * size0 // size0
padh = 0
padw = 256 - w1
scale = size0 / w1
else:
h1 = 256 * size0 // size0
w1 = 256
padh = 256 - h1
padw = 0
scale = size0 / h1
padh1 = padh//2
padh2 = padh//2 + padh%2
padw1 = padw//2
padw2 = padw//2 + padw%2
img1 = cv2.resize(img, (w1,h1))
img1 = np.pad(img1, ((padh1, padh2), (padw1, padw2), (0,0)), 'constant', constant_values=(0,0))
pad = (int(padh1 * scale), int(padw1 * scale))
img2 = cv2.resize(img1, (128,128))
return img1, img2, scale, paddef denormalize_detections(detections, scale, pad):
"""
将归一化的检测坐标映射回原始图像坐标
人脸和手掌检测器网络需要256x256和128x128的输入图像，
因此输入图像会被填充和缩放。此函数将归一化坐标映射回原始图像坐标。

输入:
detections: nxm张量。n是检测到的对象数量。
m是4+2*k，其中前4个值是边界框坐标，k是检测器输出的额外关键点数量。
scale: 用于调整图像大小的缩放因子
pad: x和y维度上的填充量
"""
detections[:, 0] = detections[:, 0] * scale * 256 - pad
detections[:, 1] = detections[:, 1] * scale * 256 - pad
detections[:, 2] = detections[:, 2] * scale * 256 - pad
detections[:, 3] = detections[:, 3] * scale * 256 - pad

detections[:, 4::2] = detections[:, 4::2] * scale * 256 - pad
detections[:, 5::2] = detections[:, 5::2] * scale * 256 - pad
return detectionsdef _decode_boxes(raw_boxes, anchors):
"""
将预测结果转换为实际坐标
使用锚框将模型预测转换为实际边界框坐标，一次性处理整个批次。
"""
boxes = np.zeros_like(raw_boxes)
x_center = raw_boxes[..., 0] / 128.0 * anchors[:, 2] + anchors[:, 0]
y_center = raw_boxes[..., 1] / 128.0 * anchors[:, 3] + anchors[:, 1]

w = raw_boxes[..., 2] / 128.0 * anchors[:, 2]
h = raw_boxes[..., 3] / 128.0 * anchors[:, 3]

boxes[..., 0] = y_center - h / 2.# ymin
boxes[..., 1] = x_center - w / 2.# xmin
boxes[..., 2] = y_center + h / 2.# ymax
boxes[..., 3] = x_center + w / 2.# xmax

for k in range(4):
offset = 4 + k*2
keypoint_x = raw_boxes[..., offset ] / 128.0 * anchors[:, 2] + anchors[:, 0]
keypoint_y = raw_boxes[..., offset + 1] / 128.0 * anchors[:, 3] + anchors[:, 1]
boxes[..., offset ] = keypoint_x
boxes[..., offset + 1] = keypoint_y

return boxesdef _tensors_to_detections(raw_box_tensor, raw_score_tensor, anchors):
"""
将神经网络输出转换为检测结果
神经网络输出是一个形状为(b, 896, 16)的张量，包含边界框回归预测，
以及一个形状为(b, 896, 1)的张量，包含分类置信度。
此函数将这两个"原始"张量转换为适当的检测结果。

返回一个(num_detections, 17)的张量列表，每个张量对应批次中的一张图像。
"""
detection_boxes = _decode_boxes(raw_box_tensor, anchors)

thresh = 100.0
raw_score_tensor = np.clip(raw_score_tensor, -thresh, thresh)
detection_scores = expit(raw_score_tensor)

# 注意：我们从分数张量中去掉了最后一个维度，因为只有一个类别。
# 现在我们可以简单地使用掩码来过滤掉置信度太低的框。
mask = detection_scores >= 0.75

# 由于批次中的每张图像可能有不同数量的检测结果，
# 因此使用循环一次处理一个图像。
boxes = detection_boxes
scores = detection_scores
scores = scores[..., np.newaxis]
return np.hstack((boxes, scores))def py_cpu_nms(dets, thresh):
"""
纯Python实现的非极大值抑制算法
用于过滤重叠的检测框，保留置信度最高的框。
"""
x1 = dets[:, 0]
y1 = dets[:, 1]
x2 = dets[:, 2]
y2 = dets[:, 3]
scores = dets[:, 12]

areas = (x2 - x1 + 1) * (y2 - y1 + 1)
# 按置信度从大到小排序，获取索引
order = scores.argsort()[::-1]
# keep列表存储最终保留的边框
keep = []
while order.size > 0:
# order是当前分数最大的窗口，肯定要保留
i = order
keep.append(dets)
# 计算窗口i与其他所有窗口的交叠部分的面积，矩阵计算
xx1 = np.maximum(x1, x1])
yy1 = np.maximum(y1, y1])
xx2 = np.minimum(x2, x2])
yy2 = np.minimum(y2, y2])

w = np.maximum(0.0, xx2 - xx1 + 1)
h = np.maximum(0.0, yy2 - yy1 + 1)
inter = w * h
# 计算IoU（交并比）
ovr = inter / (areas + areas] - inter)
# ind为所有与窗口i的IoU值小于阈值的窗口的索引
inds = np.where(ovr <= thresh)
# 下一次计算前要把窗口i去除，所以索引加1
order = order
return keep

模型位置

人脸和手掌检测器网络需要256x256和128x128的输入图像，
因此输入图像会被填充和缩放。此函数将归一化坐标映射回原始图像坐标。

输入:
detections: nxm张量。n是检测到的对象数量。
m是4+2*k，其中前4个值是边界框坐标，k是检测器输出的额外关键点数量。
scale: 用于调整图像大小的缩放因子
pad: x和y维度上的填充量
"""
detections[:, 0] = detections[:, 0] * scale * 256 - pad
detections[:, 1] = detections[:, 1] * scale * 256 - pad
detections[:, 2] = detections[:, 2] * scale * 256 - pad
detections[:, 3] = detections[:, 3] * scale * 256 - pad

detections[:, 4::2] = detections[:, 4::2] * scale * 256 - pad
detections[:, 5::2] = detections[:, 5::2] * scale * 256 - pad
return detections模型效果

来源：程序园用户自行投稿发布，如果侵权，请联系站长删除
免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！

页: [1]

程序园's Archiver

高通手机跑AI系列之——姿态识别