Knowledge-enhanced Vision-Language Models for Few-Shot Object Detection in Construction Site
Keywords:
Construction sites, object recognition, knowledge-enhanced, vision-language models, few-shot learningAbstract
Visual understanding of complex construction site objects is critical for project safety management and worker-robot collaboration within the construction domain. However, deploying deep learning algorithms on construction sites presents significant challenges due to high data annotation costs, substantial computational requirements, and the absence of large-scale training datasets. While large-scale pre-trained multimodal foundation models have shown success in natural language understanding and visual recognition, their application in construction safety management remains limited because of the need for domain-specific knowledge. To address these challenges, this paper proposes a knowledge-enhanced multimodal learning approach for few-shot object detection in construction scenarios. The proposed method comprises two components: (1) leveraging existing semantic knowledge in the construction domain to detect potential objects in construction scenes using a template matching approach; and (2) introducing a multimodal image semantic recognition method that integrates visual and textual knowledge specific to the construction field. We evaluate our approach on the AIMDataset. The results demonstrate that, without network training or large-scale construction samples, our method can achieve effective object detection under few-shot conditions using only existing models and a small amount of provided visual and textual knowledge. This approach highlights its potential for applications in construction scenarios.