Scaling Data Sourcing with ML-Powered Web Scraping
Customer
Construction Data Business
Partners Since 2005
Partnership Since 2017
Problem Statement
Data Loss
Robust System
Hindered Team collaboration
LLM and ChatGPT
Solution Development
POC 1 employed a rule-based mapping approach was used, but it struggled due to the inability to cover the inconsistencies in data formats across different sources.
POC 2 used NLP and NER for free-text extraction but lacked sufficient accuracy. The final solution integrated rule-based processing with machine learning. Using AWS SageMaker for cost-efficient training and hosting, advanced models like BERT and Decision Forest addressed challenges like multilabel and multiclass problems. This hybrid approach significantly reduced manual intervention, streamlined operations, and lowered costs.
Sagemaker + AWS
Collaborative Efforts
Scalable Infrastructure
Resource Optimization
Outcomes
Model Accuracy: Five machine learning models, trained and hosted using cost-efficient AWS SageMaker resources, achieved 95%-98% accuracy, surpassing human performance in some cases.
Efficiency Gains: Automation reduced manual data entry by 41%, saving 500,000 entries annually and eliminating 2,000 person-days of effort.
Future Prospects: Machine learning models will be continuously retrained to adapt to evolving data patterns, ensuring sustained improvements in accuracy and efficiency while further surpassing the original 50% automation target.