Smarter Web Data Collection | API-Powered and Compliance-Ready

From proxies to precision, we help modern businesses unlock insights faster with ethical, scalable, API-driven data extraction.

Customer

Our client is a leading UK-based construction market intelligence service provider, serving a diverse user base that includes building material suppliers, labor contractors, consultants, architects, designers, electricians, plumbers, and more.

Problem Statement

Business As-Is

The client gathered construction-related market intelligence primarily from publicly available regulatory processes. However, the diversity of forms and formats made each source unique, creating significant challenges in extracting relevant market information. Accessing individual portals, folders, and files was nearly impossible at scale.

Despite implementing crawling and scraping practices, success was minimal. At best, the client could extract basic pointers, which then required manual expansion to make the data market-ready. This time-consuming process delayed go-to-market strategies and compromised the freshness of insights, critical for maintaining a first-mover advantage.

Business To-Be

The client aimed to become a first mover in delivering AI-enabled market intelligence, requiring near real-time data ingestion with high levels of completeness and accuracy.

 

Identified Business Use Cases

  • GDPR-Compliant Crawling: Replace traditional proxy services with ethical, GDPR-compliant crawling and scraping techniques to ensure data privacy.
  • Infrastructure Optimization: Eliminate or minimize infrastructure bottlenecks and associated high costs.
  • Automation of Data Gathering: Reduce manual intervention by automating extraction from diverse sources and formats.
  • Intelligent Data Organization: Ensure data is logically structured, sequenced, and integrity-tracked for downstream processing.

Solution Development

Given the complexity of existing systems, a single “first-time-right” solution risked disrupting revenue flows. Arrk recommended an iterative PoC strategy for rapid evaluation. After multiple iterations, we selected Zyte API as the core solution, complemented by additional services.

Adopted Tech Stack:

  • Python with Poetry
  • AWS Services – EventBridge, Step Function, Lambda, SQS, S3, Batch, ECR,
  • OpenSearch
  • Zyte API
  • New Relic
  • RDS (Aurora PostgreSQL)

 

Key Challenges and Resolutions

  • Source Diversity & Failures: Multiple iterations were required to identify common failure patterns and address them.
  • Configuration Complexities: We had to introduce JSON based configurations in place of YAML configurations to run test fetch to ensure all possible filters and key areas are properly filtered and data are fetched accordingly.
  • File Type & Format Issues: Managed challenges with uncommon websites, folder structures, and scanned documents.
  • Document Access & Reading: It was an uphill task to access document and meaningfully read and extract data from them. At all states we had to deal with Captcha, disclaimers and in the process to ensure data integrity. MistralOCR implementation helped us to achieve reading multiple varieties of documents that has improved text extraction accuracy.
  • Infrastructure Sizing & Cost: Balanced resource allocation, cost, and processing time through meticulous planning.
  • Large Document Processing: Resolved gateway timeouts by implementing asynchronous ingestion architecture using AWS Batch.
  • Parallel System Runs: Maintained blue-green deployments for several weeks to ensure stability and maturity.Continuous improvement was achieved by engaging in-house subject matter experts to extensively test and refine the system.

Outcomes

  • Reduced manual operational effort by automating repetitive, time-consuming tasks.
  • Significantly enhanced user experience and interaction quality.
  • Improved consistency, accuracy, and reliability of responses through contextual data grounding and workflow orchestration.

Established a foundational data extraction platform and practices that can be adopted to other areas of business

More of Our Case Studies