explorer

A tool for creating rule-based text classifiers.

input template

Make a copy of this Google Sheet.

supported rules

In the context of the spaCy NLP library, a token refers to an individual unit of text that has been segmented by the library’s tokenization process. This segmentation typically involves splitting a text into words, punctuation, or other meaningful units.

rules you’ll probably use

Rule Description
ORTH, ORTH_IN, ORTH_NOT_IN The original text of the token. For _IN and _NOT_IN, provide a comma-separated list.
TEXT, TEXT_IN, TEXT_NOT_IN Same as above.
LOWER, LOWER_IN, LOWER_NOT_IN The lowercase form of the token. For _IN and _NOT_IN, provide a comma-separated list.
LENGTH The number of characters in the token.
IS_ALPHA Whether the token consists of alphabetic characters.
IS_ASCII Whether the token consists of ASCII characters.
IS_DIGIT Indicates if the token consists of digits.
IS_LOWER Indicates if the token is in lowercase.
IS_UPPER Indicates if the token is in UPPERCASE.
IS_TITLE Indicates if the token is in Title Case.
IS_PUNCT Indicates if the token consists of punctuation characters.
IS_SPACE Indicates if the token consists of whitespace characters.
IS_STOP Indicates if the token is a stop word.
IS_SENT_START Indicates if the token is the start of a sentence.
LIKE_NUM Indicates if the token resembles a number.
LIKE_URL Indicates if the token resembles a URL.
LIKE_EMAIL Indicates if the token resembles an email address.
LEMMA, LEMMA_IN, LEMMA_NOT_IN The lemma form of the token. For example, the lemma “build” represents “builds”, “building”, “built”, etc. For _IN and _NOT_IN, provide a comma-separated list.
OP Operator. One of ? (optional), + (occurs 1+ times), * (occurs 0+ times), ! (does not occur)
ENT_TYPE The named entity type assigned by spaCy. Uses the out of the box NER classes: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
REGEX A regular expression pattern.

rules you probably won’t use (but are supported anyway)

Rule Description
NORM The normalized form of the token. !TODO
SHAPE The shape of the token.
POS The coarse-grained part-of-speech tag assigned by spaCy.
TAG The fine-grained part-of-speech tag assigned by spaCy.
MORPH The morphological features of the token.
DEP The dependency label of a token to its head assigned by spaCy.
SPACY Whether the token has a trailing space.