Quick demonstration

This demonstration focuses on Linguistica 5 as a Python library. There are two other use modes: Graphical user interface (GUI) and Command line interface (CLI).

For the nature and format of datasets, see Data.

For more details about Linguistica 5 as a Python library: Python library and Full API documentation

The Basics

Importing the linguistica library
Using a built-in dataset (the English Brown corpus)
Creating a Linguistica object for a given dataset

import linguistica as lxa
from linguistica.datasets import brown

lxa_object = lxa.read_corpus(brown)

brown (str) is the file path of the English Brown corpus text file built in the Linguistica 5 library. If you would like to use a corpus text from the local drive, the argument for linguistica.read_corpus() should take the file path (as a str) of your desired text file.

Methods with a Linguistica object

Sample uses: (1) word trigrams, (2) signatures to stems

(1) Word trigrams

trigrams = lxa_object.word_trigram_counter()

trigrams is a dict with word trigrams (each as a tuple) mapped to their respective counts.

for trigram, count_ in sorted(trigrams.items(), key=lambda x: x[1], reverse=True):
    print(trigram, count_)

    if count_ < 100:
        break

(',', 'and', 'the') 662
('one', 'of', 'the') 403
('the', 'united', 'states') 328
(',', 'however', ',') 321
(',', 'in', 'the') 266
('.', 's', '.') 266
(',', 'he', 'said') 257
('as', 'well', 'as') 238
('u', '.', 's') 235
(',', 'it', 'is') 234
(',', 'and', 'he') 225
('of', 'course', ',') 220
(',', 'of', 'course') 189
('some', 'of', 'the') 179
('the', 'u', '.') 176
('out', 'of', 'the') 174
('the', 'fact', 'that') 167
(',', 'but', 'the') 161
(',', 'mr', '.') 159
(',', 'and', 'a') 158
('for', 'example', ',') 153
('.', 'm', '.') 153
('the', 'end', 'of') 149
(',', 'but', 'he') 148
('part', 'of', 'the') 144
('he', 'said', ',') 143
('it', 'was', 'a') 143
('there', 'was', 'a') 142
('it', 'is', 'not') 136
('to', 'be', 'a') 133
('there', 'was', 'no') 132
(',', 'and', 'i') 132
(',', 'too', ',') 131
(',', 'it', 'was') 129
('there', 'is', 'a') 128
('of', 'the', 'united') 127
(',', 'with', 'the') 124
('a', 'number', 'of') 123
(',', 'mrs', '.') 121
('in', 'order', 'to') 120
(',', 'and', 'that') 120
(',', 'but', 'it') 120
(',', 'and', 'in') 119
('it', 'is', 'a') 114
('most', 'of', 'the') 114
('members', 'of', 'the') 110
(',', 'and', 'it') 109
(',', 'he', 'was') 109
('end', 'of', 'the') 108
('of', 'the', 'new') 107
('it', 'would', 'be') 107
(',', 'for', 'the') 106
('the', 'number', 'of') 104
('there', 'is', 'no') 104
('he', 'did', 'not') 103
('at', 'the', 'same') 103
('.', 'c', '.') 102
(',', 'and', 'then') 102
(',', 'she', 'said') 102
('the', 'use', 'of') 102
('in', 'fact', ',') 101
('on', 'the', 'other') 100
('he', 'said', '.') 100
(',', 'on', 'the') 99

Given trigrams is a dict that maps something to counts, it is natural to convert it to a collections.Counter instance and take advantage of the methods available (e.g., collections.Counter.most_common() for accessing the most common k items).

(2) Signatures to stems

sigs_to_stems = lxa_object.signatures_to_stems()

for sig, stems in sorted(sigs_to_stems.items(), key=lambda x: len(x[1]), reverse=True):
    print(sig, len(stems))

    if len(stems) < 50:
        break

('NULL', 's') 2327
("'s", 'NULL') 813
('NULL', 'ly') 587
('NULL', 'd', 's') 346
('NULL', 'd') 314
('ed', 'ing') 197
("'", 'NULL') 190
("'s", 'NULL', 's') 181
('d', 's') 175
('ies', 'y') 173
('NULL', 'ed', 'ing', 's') 151
('NULL', 'ed') 134
('NULL', 'ed', 'ing') 130
('e', 'ed', 'es', 'ing') 130
('NULL', 'ing') 105
('d', 'r') 98
('e', 'y') 95
('e', 'ed', 'ing') 88
('ng', 'on') 85
('NULL', 'ed', 's') 82
('NULL', 'ly', 'ness') 74
("'", 'g') 72
('d', 'r', 'rs') 66
('NULL', 'es') 63
('NULL', 'ness') 60
('ng', 'on', 'ons') 57
('NULL', 'e') 51
('NULL', 'ally') 47

For all methods available to a Linguistica objects: Full API documentation