Article contributed by Avenga and authored by: Igor Kruglyak, Senior Advisor at the global IT service provider Avenga and a 35+ year veteran executive of global key development and deployment projects, and Michael DePalma, Founder and President of Pensare, LLC; Co-Founder of Hu-manity.co, The Human API. Holder of 3 US patents and a 2-time TED-speaker.
Natural Language Processing (NLP) is an area of artificial intelligence and computational linguistics in which a computer can extract meaning from written or spoken language. NLP systems have been around since the 1950s, but in the beginning, they were comparatively simple and mainly based on sets of hand-written rules. The late 1980s saw the introduction of machine learning (ML) algorithms for language processing, meaning they could train themselves. But even less than a decade ago to understand what a text was about, NLP algorithms did not do much more than count how often certain words occurred.

In recent years, NLP has made tremendous progress due to improvements in statistics, processing speed, and the ever-growing amount of available data. As described in the VentureBeat article “Language AI is really heating up” by Pieter Buteneers, important milestones include Google’s word2vec algorithm, which in 2013 mapped synonyms on top of each other and was able to model meaning like size, gender, speed, and even functional relations like countries and their capitals. However, the big breakthrough came in 2018 with the BERT model and its ability to detect the meaning of a word in relation to its context in a sentence. BERT now beats human performance on a broad range of language understanding tasks. One of its successors, the T5 model, performs even better than humans in labeling sentences and finding the right answers to questions.
Turning data into information
Advancements like these are of particular interest to the pharmaceutical industry because it produces volumes of unstructured data from medical devices, wearables and sensors, health records, publications, articles, surveys, and so on. Yet, 80% of the medical information remains unstructured after its creation. IDC research states that the amount of data generated by healthcare will grow at a Compound Annual Growth Rate (CAGR) of 36% through 2025.
Modern NLP techniques can extract valuable information from this ever-growing volume of data. They can be used across virtually any type of textual documents such as electronic health records, clinical trial data, lab reports, whitepapers, medical and healthcare regulatory filings, or scientific publications and articles. While a standard keyword search only retrieves documents that researchers must then read, NLP reads the documents and can be used for automated entity recognition, categorization of topic and themes, summarization of long text bodies, or even multi-document summarization, intention detection or sentiment analysis.
Using NLP, the unstructured medical data can be organized, summarized and synopsized. It enables the fast review, analysis, and visualization of big amounts of data in an easy to digest format. For researchers, this often means enormous time savings and an improved basis for decision-making.
Speeding up patient recruitment for clinical trials
Another important challenge NLP can help to overcome is poor patient recruitment, which is the largest cost driver in clinical trials. Current estimates suggest that almost 85% of clinical trials fail to retain enough patients for successful study conduct. Patient recruitment and retention-related concerns have been associated with massive delays, with over 90% of clinical trials failing to comply with predetermined completion dates, due to poor participant accrual and excessive subject dropout. For a blockbuster drug, this can easily mean millions of USD in capital losses per day. Taking into account that the number of clinical trials is rapidly increasing, which means that more patients are needed, it comes as no surprise that the patient recruitment services market is expected to grow at an annualized rate of nearly ~4% up to $5.3 billion by 2030.
To address the issue of low patient recruitment rates, a proven solution is for clinical research organizations (CROs) to work with physicians and identify influencers who can help with patient enrollment. But how do CROs do this efficiently and ensure the timely market release of a pharmaceutical product?
Combining NLP and social graph technique
The answer lies in combining NLP with an approach many brands are using when marketing their products: the social graph technique. The term refers to a method of data analysis derived from using social networks to find influencers; people engaging with the largest and most relevant audience on social media. It is most often represented as a map with nodes (influencers and followers) connected with lines (various kinds of subscriptions on social media).
The most famous social graph is the Facebook social graph, connecting over 2.7 billion monthly active users.
Similar social graphs can be created for different industries and organizations, including the pharmaceutical industry. To do so, multiple data sources, such as public ones like PubMed, ClinicalTrials.gov, or H-CUP, and web ones, such as Google Scholar, vitals.com, and ratemds.com, can be used. They can be enriched, for instance, with the anonymized records of overall patient flows from medical practices or the social media activities of physicians. With the help of NLP, the data from these datasets can be structured, semantically parsed, and pre-processed with extracted keywords and relationships between nodes.
Ranked by impact and visualized in the form of heatmaps, CROs can easily find the doctors with the most relevant audiences, based on their area of expertise and geographical location, and then advertise to them directly. These doctors can then inform their patients about an opportunity to take part in a clinical trial that could help solve their health issues. As the CROs do not know anything about personal medical details and cannot address the patients directly, the latter’s privacy is preserved at all times.
Improve market penetration and spendings
Another possibility, to make use of NLP and data sciences in the pharma sector, is to screen publications for authors, who have conducted a considerable amount of research on a topic and who can then provide a valuable contribution to a study. Visualized in a heat map, it enables employees of clinical research organizations (CROs) to understand, with just one glance, an author’s authority on a certain topic. It can also be used to see the connections between investigators and invite previously not invited ones (for example, if they have conducted research on a corresponding topic) to participate in a clinical study. This knowledge can help sponsor-companies to increase their international and domestic market penetration as well as to spend less money on marketing because they can allocate their resources more effectively.
If you find this topic interesting and would like to find out more, please have a look at our whitepaper “NLP for investigator recruitment”.
If you want to find out more about the opportunities of virtual clinical trials and patient portals, or how big data and predictive analytics can be utilized to improve the chances of clinical research success, we advocate that you read another whitepaper: “Digital Clinical Trial Management”.