A tool for creating rule-based text classifiers.
Make a copy of this Google Sheet.
In the context of the spaCy NLP library, a token refers to an individual unit of text that has been segmented by the library’s tokenization process. This segmentation typically involves splitting a text into words, punctuation, or other meaningful units.
Rule | Description |
---|---|
ORTH , ORTH_IN , ORTH_NOT_IN |
The original text of the token. For _IN and _NOT_IN , provide a comma-separated list. |
TEXT , TEXT_IN , TEXT_NOT_IN |
Same as above. |
LOWER , LOWER_IN , LOWER_NOT_IN |
The lowercase form of the token. For _IN and _NOT_IN , provide a comma-separated list. |
LENGTH |
The number of characters in the token. |
IS_ALPHA |
Whether the token consists of alphabetic characters. |
IS_ASCII |
Whether the token consists of ASCII characters. |
IS_DIGIT |
Indicates if the token consists of digits. |
IS_LOWER |
Indicates if the token is in lowercase. |
IS_UPPER |
Indicates if the token is in UPPERCASE. |
IS_TITLE |
Indicates if the token is in Title Case. |
IS_PUNCT |
Indicates if the token consists of punctuation characters. |
IS_SPACE |
Indicates if the token consists of whitespace characters. |
IS_STOP |
Indicates if the token is a stop word. |
IS_SENT_START |
Indicates if the token is the start of a sentence. |
LIKE_NUM |
Indicates if the token resembles a number. |
LIKE_URL |
Indicates if the token resembles a URL. |
LIKE_EMAIL |
Indicates if the token resembles an email address. |
LEMMA , LEMMA_IN , LEMMA_NOT_IN |
The lemma form of the token. For example, the lemma “build” represents “builds”, “building”, “built”, etc. For _IN and _NOT_IN , provide a comma-separated list. |
OP |
Operator. One of ? (optional), + (occurs 1+ times), * (occurs 0+ times), ! (does not occur) |
ENT_TYPE |
The named entity type assigned by spaCy. Uses the out of the box NER classes: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART |
REGEX |
A regular expression pattern. |
Rule | Description |
---|---|
NORM |
The normalized form of the token. !TODO |
SHAPE |
The shape of the token. |
POS |
The coarse-grained part-of-speech tag assigned by spaCy. |
TAG |
The fine-grained part-of-speech tag assigned by spaCy. |
MORPH |
The morphological features of the token. |
DEP |
The dependency label of a token to its head assigned by spaCy. |
SPACY |
Whether the token has a trailing space. |