Recognizing safety violations in construction environments is critical yet remains underexplored in computer vision. Existing models predominantly rely on 2D object detection, which fails to capture the complexities of real-world violations due to: (i) an oversimplified task formulation treating violation recognition merely as object detection, (ii) in-adequate validation under realistic conditions, (iii) absence of standardized baselines, and (iv) limited scalability from the unavailability of synthetic dataset generators for diverse construction scenarios. To address these challenges, we introduce Safe-Construct, the first framework that reformulates violation recognition as a 3D multi-view engagement task, leveraging scene-level worker-object context and 3D spatial understanding. We also propose the Synthetic Indoor Construction Site Generator (SICSG) to create diverse, scalable training data, overcoming data limitations. Safe-Construct achieves a 7.6% improvement over state-of-the-art methods across four violation types. We rigorously evaluate our approach in near-realistic settings, incorporating four violations, four workers, 14 objects, and challenging conditions like occlusions (worker-object, worker-worker) and variable illumination (back-lighting, overexposure, sunlight). By integrating 3D multi-view spatial understanding and synthetic data generation, Safe-Construct sets a new benchmark for scalable and robust safety monitoring in high-risk industries.
We are the first to formulate violation recognition as a 3D multi-view engagement task. By leveraging geometry-based modeling and multi-view inputs, our approach achieves occlusion-robust, scene-level understanding that surpasses existing state-of-the-art methods.
Safe-Construct is the first framework to decouple violation criteria from training data, enabling scalable generalization to new violation types without the need for collecting additional real-dataset datasets.
We introduce the Synthetic Indoor Construction Scene Generator (SICSG), a novel custom engine that generates physically realistic scene variations, such as illumination, occlusion, and perspective changes, imparting spatial awareness and physical common sense to the model.
We conduct the first evaluation in a 3D multi-camera indoor construction setup, comprising four safety violations, four workers and 14 objects across diverse conditions—occlusions, lighting variations, and camera distances resulting in significant scale changes in worker bodies, significantly increasing scene complexity. Safe-Construct consistently outperforms prior methods. Moreover, it is the first model tailored specifically for indoor construction settings.
We show the 2D re-projection of worker and object's pose on the image plane: Rows (a), (b) show two safe scenarios, while (c), (d) illustrate two violations: (a) Worker wearing a hard hat, (b) Second worker holding the Step Ladder when the first worker climbs it, (c) Only one worker is carrying a Large Window that should be carried by two workers showing a violation scenario (the small window is represented in magenta), (d) Two workers are standing on the Platform simultaneously.
(a) A case when the hard hat is not detected. We mark this as a violation, based on the previous frame i.e., unless the worker wears the hard hat again, all frames are tagged as violations. (b) Increasing the number of views improves model prediction.
@misc{chharia2025safeconstructredefiningconstructionsafety,
title={Safe-Construct: Redefining Construction Safety Violation Recognition as 3D Multi-View Engagement Task},
author={Aviral Chharia and Tianyu Ren and Tomotake Furuhata and Kenji Shimada},
year={2025},
eprint={2504.10880},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.10880},
}