About Papereader
At Mana.bio, we use papereader to automate and scale up our scientific paper screening process.
Papereader is a text analysis tool that automatically extracts relevant data from research papers and patents.
It currently identifies dozens of properties in the drug delivery domain; a few examples are: payloads, organs, formulation ratios, and types of experiments performed. It is customizable and extensible to support other domains and new categories; see the Configuration section below for more details.
We leverage it for two use cases:
- Single paper mode: used to get a TL;DR version of a paper of interest before diving to it.
- Batch mode, where we input a batch of candidate papers along with our priorities, and receive statistics and recommendations regarding which papers are the most relevant to our needs. For example, we can prioritize papers containing LNPs (among other carriers) reaching the lungs, along with involved formulation ratios and modifications.
The web app currently supports the single paper mode only, batch mode will be added in the future.
Configuration
papereader has a structured configuration file in the TOML format. It contains a list of categories of interest (e.g., payload, organ, and N:P ratio), where each category can be shallow or nested:
- Shallow means it contains a list of keywords (e.g.,
cell_analysis
containsFACS
,flow cytometry
,cell sorting
, etc.). - Nested means it contains a list of subcategories (e.g.,
organ
containslung
,heart
,liver
, etc.), where each subcategory contains a list of keywords (e.g.,lung
containslung
,pulmonary
, andbronchial
).
Examples
Shallow category:
[category.cell_analysis]
keywords = [
"FACS",
"flow cytometry",
"cell sorting",
"cell separation",
"cell purification",
"cell sorter",
]
Nested category:
[category.organ]
[category.organ.heart]
keywords = ["heart", "aorta", "coronary artery"]
[category.organ.immune]
keywords = ["immune", "lymphocyte"]
Notes
- Keywords are case-insensitive and are searched for as-is (i.e.,
T cell
andt cell
will yield the same result). - Plural keywords are implicitely added by default (i.e.,
heart
will also matchhearts
, andmouse
will also matchmice
). This can be disabled by settingallow_plural = false
in the (sub)category. - Category names are converted to a readable form for the output by default (i.e.,
heart
will becomeHeart
, andnano_aggregate
will becomeNano aggregate
). You can specify a different name using thename
property. For example, to turn then_to_p_ratio
category into the more readableN:P ratio
use:
[category.n_to_p_ratio]
name = "N:P ratio"
- In case no keywords are specified for a (sub)category, the category name's readable form is implicitely used as a keyword.
For example, the following category will implicitely contain
size
as its only keyword:
[category.physicochemical_property.size]
- There are several special, custom-coded categories:
most_common_words
: identifies common words in the document, excluding English stopwords and stopwords specified in the configuration file, for example:
[stopwords]
words = [
"figure",
"fig",
"experiment",
"sample",
]
expression_measurement_time
: identifies the amount of time passed between the treatment and the expression measurement. Example configuration:
[category.expression_measurement_time]
[category.expression_measurement_time.hour]
keywords = ["hour", "h", "hr"]
formulation_ratio
: identifies formulation ratios. Currently, we output actual ratios, as well as passages where ratios are mentioned if we identify ratios that weren't identified before (e.g., ratios described in words instead of numbers).n_to_p_ratio
: identifies N:P ratios. The specified keywords are used to identify these ratios (e.g.,w/w
). Currently, we output passages where N:P ratios are mentioned, but we do not extract the actual ratios.study_type
: identifies mentions related to the type of study (i.e.,in vitro
,in vivo
, andex vivo
).toxicity
: identifies toxicity-related keywords.
- If you want to merge the counts of all subcategories into a single one, specify
combine_keyword_counts = true
under the category. For example, using this config:
[category.cell_analysis]
keywords = [
"FACS",
"flow cytometry",
"cell sorting",
"cell separation",
"cell purification",
"cell sorter",
]
Let's say we got FACS: 4
and Flow cytometry: 8
.
Changing the config to:
[category.cell_analysis]
combine_keyword_counts = true
keywords = [
"FACS",
"flow cytometry",
"cell sorting",
"cell separation",
"cell purification",
"cell sorter",
]
will result in Cell analysis: 12
.