This post contains more details about the inner workings of ONCOgen, a software program that can generate New England Journal of Medicine-formatted clinical trials. You can read more on the ONCOgen page or at the first post in this series.
When you pick up a scientific paper the expectation is that a certain, specific set of information is included in the manuscript that allows the reader to understand why the experiment was performed, what experiment was performed, what the results of the experiment were, and how they were analyzed. While we expect the authors to provide their interpretation of the results and their significance, we as the readers rightfully expect to have enough information to judge the results for ourselves.
In broad strokes, the information in every clinical trial comes through in four major sections:
- Background & Introduction
Each of these can be subdivided further. For example, a Methods section may include any or all of the items below:
- Patient Population
- Study Design
- End points
- Statistical Analysis
- Study Oversight
While each section might be worded a little differently or subsections might be more or less inclusive, everything should be there. You can keep breaking this down further: a sentence or two about inclusion criteria; a sentence about the a priori significance level; and a few sentences describing the target disease with lab values or specific biomarker requirements.
For the purposes of writing a fake paper, the high-level structure of a paper seems like an easy target; even the paragraph level seems pretty straight-forward. But there are practically a million ways to say, well, stuff. How do you make semi-convincing fake verbiage?
Well, we have to zoom in a little more for that.
Every elementary student learns about parts of speech—nouns, verbs, adjectives, pronouns—that kind of thing. We also learn intuitively by listening to the people around us speak and in this way we learn the rules of our language, how they work, and when they apply. We learn that—say, the sentence—
Suzy goes to the park.
—can just as easily be used to say that Bob is going to the park or that Suzy is going to the mall. In a broad sense we can say that this sentence is more generalizable:
PROPERNOUN VERB to the NOUN
We could include ‘to’ (a preposition) and ‘the’ (a definite article) but they are beyond the scope of this explanation. For now, lets stick to the major ones.
Now, what if we had a list of PROPERNOUNs and VERBs and NOUNs? Now we do!
- Bruce Willis
- Sky city
By selecting a word at random from each list we can create sentences that technically fulfill the general rules that we laid out. All of them are the same type of word as their parent, but problems quickly become apparent:
Batman builds to the radio.
Suzy eats to the park.
Boeing bounces to the garbage.
Just like a Mad-Lib—after replacing a random word in the basic sentence with another word that is technically a VERB or a NOUN all of our sentences just sound wrong.
You might be unsurprised to learn that context matters in language. We cannot reasonably expect to replace NOUN with any noun, it has to be a noun that makes sense in the context of the rest of the sentence. ‘Goes’ pairs with a place that you would go to—like a park or a beach—and eats pairs with something you would eat—like a banana or a cake—and not something intangible like ‘trust’ or ‘rules,’ as interesting as that possibility might be.
On top of this, ‘goes’ will usually have something that gives the reader more information about the VERB. A word like ‘around,’ ‘above,’ or ‘in-between’ are examples of adverbs, words that modify a VERB.
Suzy goes around the park.
Which we can generalize to—
PROPERNOUN VERB ADVERB the NOUN.
Suzy goes sweetly the park.
Just as before, ‘sweetly’ is not a word that can modify ‘goes.’ Again, we intuit this as participants in our shared language and so this carries on for all other parts of speech. Context—our gradual collection of knowledge of what a given noun or verb or whatever is and what properties it can have determines the words we use and how we construct thoughts the way we communicate. It follows that context-free grammar is just what it sounds like—a basic consideration for the type of word rather than the meaning of the word.
Linking Things Together
Remember ‘to’ and ‘the’ from earlier? Writing to you right now is dependent on the proper use of articles, prepositions, and conjunctions (such as to, a, the, on, at, and, as a mixed list) that link all of the other words together properly.
‘Batman builds to the radio’ suddenly makes considerably more sense when the proper link between the verb and the noun is used: ‘Batman builds the radio.’ Like before, we intuitively lump verbs and nouns together along with the modifiers required for their use. Consider a non-native English speaker, my terrible attempts at verbal Spanish, or a child speaking for the first time—it is easy to miss those necessary linker words that explicitly lay out the meaning of the sentence. Consider traveling on a car, in a car, or to a car. All have drastically different meanings and importantly, levels of comfort on the freeway.
Building it Up
Once we’ve linked our sentence together we can start to generalize even further. For instance take ‘goes.’ ‘Goes’ will almost always need an adverb that describes movement or position that immediately follows it. We can develop a simple rule that says something like:
NEWVERB = VERB + MOVEMENTADVERB
Now we have nested rules. If we take our sentence from before and tweak it a bit to—
PROPERNOUN NEWVERB to the NOUN
—when we say we want a NEWVERB we are really asking for the short phrase: VERB + MOVEMENTVERB. We can select one word from each of the nested lists to get something like ‘eats above’ that we insert in the place of NEWVERB. Take these examples:
Batman goes around the radio.
Batman eats in-between the radio.
Well, okay so it isn’t perfect—now we have the problem of plurality. You can correct this by creating additional rules that govern singular and plural forms of words but I will not cover that here. Lets’s start thinking bigger shall we with a rule like:
CLAUSE = PROPERNOUN + NEWVERB + the + NOUN
Now, whenever I want a sentence I only need a ‘CLAUSE.’ Just like the lists of nouns and verbs we can flip a coin and get the entire example from above by writing ‘CLAUSE.’
Next, consider this sentence:
CLAUSE and CLAUSE
CLAUSE because CLAUSE
Suzy goes to the park and Batman builds the radio.
Bruce Willis bounces to the sky city because Boeing goes to the park.
Suddenly we are able to create complex sentences based on a few simple rules and a very basic vocabulary. While they are absolutely devoid of meaning they nonetheless fulfill the structural requirements of the language. We can even make rules out of the sentences above:
SENTENCE = CLAUSE + and + CLAUSE
SENTENCE = CLAUSE + because + CLAUSE
If we add one more rule:
- In other words
- Despite this
We can construct rules for complex, compound sentences such as:
BIGSENTENCE = SENTENCE; + REVERSAL, + SENTENCE
Bruce Willis bounces to the sky city because Boeing goes to the park; in other words, Suzy goes to the park and Batman builds the radio.
What does this mean for these clinical trials?
Even from this very basic example you can see how it is possible to create large volumes of somewhat diverse text from a fairly small vocabulary (just 16 words in this case) and a small number of grammar rules. When we consider our clinical trial format it is a matter of assembling the correct vocabulary, explicitly questioning how these terms interact, then designing the rules for each subsection. Medical terminology is fun; with the Greek and Latin roots, positional terms, prefixes, suffixes, goofy intra-and-interword linking requirements there is the potential to generate the vocabulary dynamically as well. Enzyme names, disease states, and lab values are all examples of elements that are particularly amenable to this strategy.
Once the grammar rules are built (ONCOgen currently has several hundred for phrase, clause, and sentence types) and the vocabulary assembled it is a simple matter to link them together into paragraphs and format the document [side note: it took me nearly as long to get the formatting correct as it took to do everything else]. For example, the Methods section at its highest level looks like this:
METHOD METHODI METHODE METHOD METHOD METHOD METHOD METHOD
METHOD can be anything from a blank to a complex, multi-clause sentence. The beauty of complex-free grammar is that you get a different, interesting result every single time for everything from the drugs used to the people “writing” the trial to the little details in the Acknowledgements.
Variants and other possibilities
Another advantage of the process is that, once built, it can be quickly adapted to different sub-specialties by altering the underlying vocabulary while retaining the grammar rules. NATgen, a work-in-progress variant inspired by the story of Britt Marie Hermes can create naturopathy/homeopathy themed papers in the same format by swapping out the medications for homeopathic preparations [work-in progress example one, and example two].
Note: I am not a linguist nor a computer scientist and this meant to be an enormous simplification of these concepts as I understand them. I encourage you to share your thoughts, any corrections, and comments below.