About Papereader

At Mana.bio, we use papereader to automate and scale up our scientific paper screening process.

Papereader is a text analysis tool that automatically extracts relevant data from research papers and patents.

It currently identifies dozens of properties in the drug delivery domain; a few examples are: payloads, organs, formulation ratios, and types of experiments performed. It is customizable and extensible to support other domains and new categories; see the Configuration section below for more details.

We leverage it for two use cases:

  1. Single paper mode: used to get a TL;DR version of a paper of interest before diving to it.
  2. Batch mode, where we input a batch of candidate papers along with our priorities, and receive statistics and recommendations regarding which papers are the most relevant to our needs. For example, we can prioritize papers containing LNPs (among other carriers) reaching the lungs, along with involved formulation ratios and modifications.

The web app currently supports the single paper mode only, batch mode will be added in the future.


Configuration

papereader has a structured configuration file in the TOML format. It contains a list of categories of interest (e.g., payload, organ, and N:P ratio), where each category can be shallow or nested:

  • Shallow means it contains a list of keywords (e.g., cell_analysis contains FACS, flow cytometry, cell sorting, etc.).
  • Nested means it contains a list of subcategories (e.g., organ contains lung, heart, liver, etc.), where each subcategory contains a list of keywords (e.g., lung contains lung, pulmonary, and bronchial).

Examples

Shallow category:

[category.cell_analysis]
keywords = [
    "FACS",
    "flow cytometry",
    "cell sorting",
    "cell separation",
    "cell purification",
    "cell sorter",
]

Nested category:

[category.organ]

[category.organ.heart]
keywords = ["heart", "aorta", "coronary artery"]

[category.organ.immune]
keywords = ["immune", "lymphocyte"]

Notes

  • Keywords are case-insensitive and are searched for as-is (i.e., T cell and t cell will yield the same result).
  • Plural keywords are implicitely added by default (i.e., heart will also match hearts, and mouse will also match mice). This can be disabled by setting allow_plural = false in the (sub)category.
  • Category names are converted to a readable form for the output by default (i.e., heart will become Heart, and nano_aggregate will become Nano aggregate). You can specify a different name using the name property. For example, to turn the n_to_p_ratio category into the more readable N:P ratio use:
[category.n_to_p_ratio]
name = "N:P ratio"
  • In case no keywords are specified for a (sub)category, the category name's readable form is implicitely used as a keyword. For example, the following category will implicitely contain size as its only keyword:
[category.physicochemical_property.size]
  • There are several special, custom-coded categories:
  1. most_common_words: identifies common words in the document, excluding English stopwords and stopwords specified in the configuration file, for example:
[stopwords]
words = [
    "figure",
    "fig",
    "experiment",
    "sample",
]
  1. expression_measurement_time: identifies the amount of time passed between the treatment and the expression measurement. Example configuration:
[category.expression_measurement_time]

[category.expression_measurement_time.hour]
keywords = ["hour", "h", "hr"]
  1. formulation_ratio: identifies formulation ratios. Currently, we output actual ratios, as well as passages where ratios are mentioned if we identify ratios that weren't identified before (e.g., ratios described in words instead of numbers).
  2. n_to_p_ratio: identifies N:P ratios. The specified keywords are used to identify these ratios (e.g., w/w). Currently, we output passages where N:P ratios are mentioned, but we do not extract the actual ratios.
  3. study_type: identifies mentions related to the type of study (i.e., in vitro, in vivo, and ex vivo).
  4. toxicity: identifies toxicity-related keywords.
  • If you want to merge the counts of all subcategories into a single one, specify combine_keyword_counts = true under the category. For example, using this config:
[category.cell_analysis]
keywords = [
    "FACS",
    "flow cytometry",
    "cell sorting",
    "cell separation",
    "cell purification",
    "cell sorter",
]

Let's say we got FACS: 4 and Flow cytometry: 8. Changing the config to:

[category.cell_analysis]
combine_keyword_counts = true
keywords = [
    "FACS",
    "flow cytometry",
    "cell sorting",
    "cell separation",
    "cell purification",
    "cell sorter",
]

will result in Cell analysis: 12.