Scaling Data Sourcing with ML-Powered Web Scraping

Combining machine learning with a product-oriented approach, we enable efficient, time-critical data sourcing, harmonization, and automation, achieving cost savings and accuracy with AWS SageMaker-hosted models.

Customer

The client provides research, market analysis, and consulting for technology and business sectors, relying on outsourced data harvesting and entry from diverse, complex sources, creating a labor-intensive process.

Construction Data Business

Partners Since 2005

Partnership Since 2017

Problem Statement

Manual data entry posed high costs and delays, while diverse, unstructured data sources complicated automation. The client aimed for 50% automation to reduce costs, enhance efficiency, and streamline operations.

Data Loss

Robust System

Hindered Team collaboration

LLM and ChatGPT

Solution Development

The automation journey included three phases:

POC 1 employed a rule-based mapping approach was used, but it struggled due to the inability to cover the inconsistencies in data formats across different sources.

POC 2 used NLP and NER for free-text extraction but lacked sufficient accuracy. The final solution integrated rule-based processing with machine learning. Using AWS SageMaker for cost-efficient training and hosting, advanced models like BERT and Decision Forest addressed challenges like multilabel and multiclass problems. This hybrid approach significantly reduced manual intervention, streamlined operations, and lowered costs.

Sagemaker + AWS

Collaborative Efforts

Scalable Infrastructure

Resource Optimization

Outcomes

The machine learning solution streamlined operations and delivered measurable results:

Model Accuracy: Five machine learning models, trained and hosted using cost-efficient AWS SageMaker resources, achieved 95%-98% accuracy, surpassing human performance in some cases.

Efficiency Gains: Automation reduced manual data entry by 41%, saving 500,000 entries annually and eliminating 2,000 person-days of effort.

Future Prospects: Machine learning models will be continuously retrained to adapt to evolving data patterns, ensuring sustained improvements in accuracy and efficiency while further surpassing the original 50% automation target.