Case Study
Case Study
How PVR enhanced the probability of trial success and helped reduce its overall
cost using custom-made machine learning solutions for text and image processing
Foreword
This case study describes the project PVR conducted as part of a clinical trial enhancement process for one of its clients. The name of the client shall not be disclosed, as per the agreement between PVR and the company they represent (which kindly gave us the opportunity to share some of the developments made in the context of the project), so it will simply be referred to as ‘the Company’ or ’the Client ’henceforth”.


Brief summary
PVR provided statistical analysis of existing clinical trials, as well as AI-powered image processing solutions for its Client to identify potential inefficiencies in their clinical trial design, thus allowing them to reduce their incurred costs and enter the trial better prepared.


Client profile
PVR was contacted by a US-based, midsize company that successfully finished its Phase II clinical trial for their Drug X and wanted to launch another study from the same Phase onward for their Drug Y. Prior to this, the Company had already adjusted its trial protocol several times, costing them a significant amount of money and other resources. This motivated the Client to prepare for the next trial as thoroughly as possible.


The request
The Client asked PVR to ensure the trial’s design was well-formed and its particular constituents were optimal from a statistical point of view, i.e., there was no known negative correlation between the design constituents, such as the use of a particular exclusion criteria or the possibility of an either too lengthy or even unsuccessful clinical trial.
Furthermore, the Company decided to employ a deep analysis of all available Statistical Analysis Plans (‘SAPs’) and protocols available for their type of condition. To achieve this, the Client requested we develop a reliable solution for image processing to perform Optical Character Recognition (‘OCR’), along with subsequent similarity search.


Procedures
Following some collaborative decision-making between PVR and the Client, it was decided both data management and AI services should be leveraged by specifically focusing on data aggregation and analysis and Image Processing, respectively.
As a result, PVR provided a statistical analysis of clinical trials, highlighting the areas of the Company’s interest. This analysis yielded the required insights on the suggested trial design as far as its constituents were concerned. In addition, the Company received an in-house image processing solution dedicated to file analysis automation.


Statistical analysis
The conundrum we were faced with is simple to formulate, yet hard to answer, as it’s one of the main concerns in all clinical trials: ensuring the design of the trial is optimal.
To tackle this snag, we combined open-source information with data analysis solutions developed by PVR.
The Company was focusing on a certain condition related to Myocardial Infarction (‘MI’), which became the main subject of our research. PVR accessed information concerning 1,000 clinical trials aiming to solve the same problem and extracted all relevant information to this project,including ethnicity, gender, age, inclusion and exclusion criteria, type, as well as primary and secondary outcomes.
Distribution of maximum age chart
Distribution of minimum age chart
PVR analyzed all obtained information, including the most frequent parameters and values from the clinical trials, as well as the relative ratios of certain parameters per phase, with respect to whether or not a given trial was completed successfully or became too lengthy.
Age was one of the main concerns of the Company, as it might significantly impact the enrollment deadlines. It was decided we should conduct a statistical analysis of the relation between the status of the clinical trial and the age parameters. Using Analysis of Variance (ANOVA) to compare variances across the means of different groups, it was shown there was no statistically significant correlation between age and status, implying that age inclusion criteria can be quite flexible, even though initial analysis indicated that such a connection could in fact exist.
Timeframes and other primary outcomes statusTimeframes and other primary outcomes chart
One of the questions we were asked was how frequent the ‘year’ criterion as a time frame measure would be, since adding it requires longer follow-ups and higher budgeting to cover the matching expenses. Aside from obvious cases where the outcome can only be measured in years, this time frame turned out to be substantially irrelevant. In tandem, and judging by the acquired data, most ongoing trials intended to use years as a primary outcome time frame. Thi can be attributed to a simple observation: given that the budget per person is higher when the follow-up time scope is extended, the enrollment process becomes stricter.
Relative Percentage of the Timeframes per status
Another question we were asked was whether there were any recent clinical trials in which cardiac arrest was mentioned among other exclusion criteria and whether any of them had expanded access and were, therefore, recruiting. Unfortunately, no such studies were found in the data sample used for the project. Still, there were cases of partial requirement match (despite their being quite rare) where the study met exclusion criteria and status restrictions, such as NCT01864343.


Image processing
The problem with processing characters within an image into actual, editable text has been known for a long time and is being successfully solved in many cases.
At the same time, processing of sensitive data defines certain restrictions on which solutions can and cannot be used. Clearly, an in-house model is always a safer way to deal with such data.
Another issue lies in the fact that there are myriads of types of data, even if they’re compressed into a PDF file. So, where ‘most solutions’ work ‘almost fine’, one wants to have a dedicated model capable of performing well on a specific set of data.
As a result, PVR conducted the following steps:

1. Required data was accessed in an automatic way to save specialists’ time;
2. The format of the data was formalized and several image processors were fine-tuned on the accessed data, using the Google AI platform;
3. Each element retrieved was supplied with confidence level, to flag potentiall problematic values, which saved time for personnel to focus on data review;
4. A local LLM was launched to create summaries from the transcribed data;
5. Transcripts, summaries, and all relevant files were then uploaded to a vector database (i.e., a database where all data is stored in the form of vectors, instead of actual values) using the open-source language encoders.


Results
As requested by the Client, PVR provided data aggregation and statistical analysis of the clinical trials in the subject of ‘MI’. Considering this is commonly performed manually, the Client was able to receive the results within 24h. Furthermore, the Company received its own in-house solution for image processing, which can be used for any OCR tasks, yet specifically fine-tuned to perform well on data similar to SAP.
Moreover, all data was stored in a specifically created vector database, which allowed the Client to pose queries using natural language, thus saving time and money on personnel training/ dedicated specialist hiring.
Aside from immediate results received by the Company, the data storage infrastructure can be used to support future trials, allowing the Client to cut on costs even more.
Following the successful launch of their trial, the Company is now planning on obtaining enough data to support its Drug for the next 24 months.