explorer

A tool for creating rule-based text classifiers.

input template

supported rules

In the context of the spaCy NLP library, a token refers to an individual unit of text that has been segmented by the library’s tokenization process. This segmentation typically involves splitting a text into words, punctuation, or other meaningful units.

rules you’ll probably use

Rule	Description
`ORTH`, `ORTH_IN`, `ORTH_NOT_IN`	The original text of the token. For `_IN` and `_NOT_IN`, provide a comma-separated list.
`TEXT`, `TEXT_IN`, `TEXT_NOT_IN`	Same as above.
`LOWER`, `LOWER_IN`, `LOWER_NOT_IN`	The lowercase form of the token. For `_IN` and `_NOT_IN`, provide a comma-separated list.
`LENGTH`	The number of characters in the token.
`IS_ALPHA`	Whether the token consists of alphabetic characters.
`IS_ASCII`	Whether the token consists of ASCII characters.
`IS_DIGIT`	Indicates if the token consists of digits.
`IS_LOWER`	Indicates if the token is in lowercase.
`IS_UPPER`	Indicates if the token is in UPPERCASE.
`IS_TITLE`	Indicates if the token is in Title Case.
`IS_PUNCT`	Indicates if the token consists of punctuation characters.
`IS_SPACE`	Indicates if the token consists of whitespace characters.
`IS_STOP`	Indicates if the token is a stop word.
`IS_SENT_START`	Indicates if the token is the start of a sentence.
`LIKE_NUM`	Indicates if the token resembles a number.
`LIKE_URL`	Indicates if the token resembles a URL.
`LIKE_EMAIL`	Indicates if the token resembles an email address.
`LEMMA`, `LEMMA_IN`, `LEMMA_NOT_IN`	The lemma form of the token. For example, the lemma “build” represents “builds”, “building”, “built”, etc. For `_IN` and `_NOT_IN`, provide a comma-separated list.
`OP`	Operator. One of `?` (optional), `+` (occurs 1+ times), `*` (occurs 0+ times), `!` (does not occur)
`ENT_TYPE`	The named entity type assigned by spaCy. Uses the out of the box NER classes: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
`REGEX`	A regular expression pattern.

rules you probably won’t use (but are supported anyway)

Rule	Description
`NORM`	The normalized form of the token. !TODO
`SHAPE`	The shape of the token.
`POS`	The coarse-grained part-of-speech tag assigned by spaCy.
`TAG`	The fine-grained part-of-speech tag assigned by spaCy.
`MORPH`	The morphological features of the token.
`DEP`	The dependency label of a token to its head assigned by spaCy.
`SPACY`	Whether the token has a trailing space.