Comparison of Data Labelling Techniques for Automating Postcode Extraction in NLP-Supported Early-Stage Building Design
Keywords:
Natural Language Processing (NLP), Data Labelling, Rule-Based Models, Early-stage Building Design, Postcode ExtractionAbstract
Data labelling is crucial for the success of Natural Language Processing (NLP) models, as the quality of labelled data directly affects model accuracy and performance. In early-stage construction design, automating the data extraction of textual data is essential for integrating physical and digital workflows. However, data labelling presents significant challenges, requiring careful trade-offs between time, cost, and accuracy to meet project-specific needs. This paper compares three primary data labelling techniques for postcode extraction from project documents: manual, rule-based, and hybrid machine learning approaches. A review of the seminal literature reveals that manual labelling delivers high accuracy and quality but is labour-intensive and better suited for small datasets or creating gold standards. Rule-based techniques, such as regular expressions (Regex), automate labelling for structured data using predefined patterns, offering efficiency but requiring domain expertise. Machine learning-driven methods, like Named Entity Recognition (NER), enable scalability for large datasets but often demand task-specific fine-tuning. Due to suboptimal NER performance in initial testing, a hybrid approach combining Regex with NER was developed and implemented using Google Colab. Through empirical evaluation of postcode extraction from construction project documents, the rule-based approach achieved 96.7% accuracy when compared against manual labelling as the gold standard, while the hybrid machine learning approach achieved 98% accuracy. This paper provides a comparative framework to guide practitioners in selecting the most appropriate data labelling technique based on their specific needs, balancing accuracy, efficiency, and scalability to optimise workflows and enhance automation in early-stage building design.