How Arrk’s AI Engineering Team Automated Data Harvesting, Reducing Costs by 48% and Speeding Up Time to Market - Resulting in Increased Revenue and Customer Satisfaction.

Our customer is a well-established global business intelligence organization specialising in the collection and improvement of massive datasets, which are subsequently published for customers to use in individual transactions and market analysis.


They are especially known for their research reports, market analysis, and consulting services, delivered to technology, IT, and business-related sectors. Their research and insights are often used by businesses to make informed decisions about technology investments and business strategies.

Their data harvesting and data entry operations are managed by a remote outsourced team, while a local call centre team is responsible for research and final publication. Our customer wanted to automate the data harvesting activity across several disparate data sets.

Problem Statement

The major problem faced by our client was the cost linked with remote outsourced data extraction and entry. The sheer volume of data involved in their operations required a manual data entry activity which resulted in significant expense. AI-enabled automation was identified as a possible solution; however, the complexity of this task was exacerbated by the complex nature of the source data. These sources, including various websites, presented data in different formats and standards, often including free-text fields, making the task of automating data extraction and entry a complex endeavour.

The final aim was to achieve a 50% automation rate, a goal that appeared ambitious at the outset but was deemed attainable.

Costly Offshore Data Entry

Complex Data Sources

Free-Text Fields

Automation Goal

Solution Development

Having conducted a rapid EmbArrkTM discovery, using an iterative, product-oriented approach, a team of circa. 15 Arrkitects utilized Arrk’s agile@arrk  project engineering methodology to deliver the project goals established during the EmbArrkTM workshop.   The journey towards automating data harvesting can be summarized in three key phases:

At the outset, the approach taken was a rule-based data-mapping activity. This method required customisations for each website to adapt to their unique terminologies, classifications, and free-text fields. Unfortunately, POC 1 didn’t yield results as expected due to the vast disparities in data presentation among different websites.

POC 2 introduced Natural Language Processing (NLP) and Named Entity Recognition (NER) techniques to extract key entities from the free-text fields. This approach demonstrated potential by pinpointing data locations, but it lacked the accuracy required to confidently detect and obtain accurate data. Efforts were made to combine NLP and NER with rule-based logic, but the complexity and cost of this approach became apparent.

In the final phase, a successful solution was created, blending rule-based processing and machine learning. The AI Engineering team precisely identified all the fields that the data entry team needed to populate, along with the associated validation rules. For each field, a decision was made to employ either rule-based processing or machine learning. The machine learning models, although time-consuming to build, provided the capability to generate data for fields that had previously necessitated human intervention.

Product-Oriented Approach

Agile Methodology

EmbArrk™ Workshop


This successful AI-enabled automation of data extraction and entry in our client’s research-based business stands as evidence of the power of combining rule-based processing and machine learning. By adapting to the unique challenges presented by diverse data sources, this project has not only substantially lowered costs but also increased accuracy, surpassing human capabilities in some instances. With the ongoing project and the introduction of generative AI, the client is ready to reap even greater benefits in the future, demonstrating the tremendous potential of AI-enabled solutions in data-intensive industries.  The results of this project have been transformative:

  • Machine Learning Algorithms: The project now boasts five machine learning algorithms, with accuracies for AI-enabled data extraction and injection ranging between 95% to 98%. Remarkably, one of these algorithms outperformed human data entry accuracy.
  • Current Progress: While the project is still ongoing, it has already eliminated 41% of manual data extraction and entry. This equates to 500,000 data entries annually or 2,000 person-days of effort saved each year.
  • Future Prospects: The next phase of this AI Engineering project will involve the integration of generative AI alongside machine learning. This step is expected to exceed the original 50% automation target.

Improved Accuracy

Reduced Manual Data Entry

AI-Driven Solutions

Efficiency Gains

More of Our Case Studies