Quick demonstration
This demonstration focuses on Linguistica 5 as a Python library. There are two other use modes: Graphical user interface (GUI) and Command line interface (CLI).
For the nature and format of datasets, see Data.
For more details about Linguistica 5 as a Python library: Python library and Full API documentation
The Basics
Importing the
linguisticalibraryUsing a built-in dataset (the English Brown corpus)
Creating a Linguistica object for a given dataset
import linguistica as lxa
from linguistica.datasets import brown
lxa_object = lxa.read_corpus(brown)
brown (str) is the file path of the English Brown corpus text file
built in the Linguistica 5 library. If you would like to use a corpus
text from the local drive, the argument for linguistica.read_corpus() should
take the file path (as a str) of your desired text file.
Methods with a Linguistica object
Sample uses: (1) word trigrams, (2) signatures to stems
(1) Word trigrams
trigrams = lxa_object.word_trigram_counter()
trigrams is a dict with word trigrams (each as a tuple) mapped to
their respective counts.
for trigram, count_ in sorted(trigrams.items(), key=lambda x: x[1], reverse=True):
print(trigram, count_)
if count_ < 100:
break
(',', 'and', 'the') 662
('one', 'of', 'the') 403
('the', 'united', 'states') 328
(',', 'however', ',') 321
(',', 'in', 'the') 266
('.', 's', '.') 266
(',', 'he', 'said') 257
('as', 'well', 'as') 238
('u', '.', 's') 235
(',', 'it', 'is') 234
(',', 'and', 'he') 225
('of', 'course', ',') 220
(',', 'of', 'course') 189
('some', 'of', 'the') 179
('the', 'u', '.') 176
('out', 'of', 'the') 174
('the', 'fact', 'that') 167
(',', 'but', 'the') 161
(',', 'mr', '.') 159
(',', 'and', 'a') 158
('for', 'example', ',') 153
('.', 'm', '.') 153
('the', 'end', 'of') 149
(',', 'but', 'he') 148
('part', 'of', 'the') 144
('he', 'said', ',') 143
('it', 'was', 'a') 143
('there', 'was', 'a') 142
('it', 'is', 'not') 136
('to', 'be', 'a') 133
('there', 'was', 'no') 132
(',', 'and', 'i') 132
(',', 'too', ',') 131
(',', 'it', 'was') 129
('there', 'is', 'a') 128
('of', 'the', 'united') 127
(',', 'with', 'the') 124
('a', 'number', 'of') 123
(',', 'mrs', '.') 121
('in', 'order', 'to') 120
(',', 'and', 'that') 120
(',', 'but', 'it') 120
(',', 'and', 'in') 119
('it', 'is', 'a') 114
('most', 'of', 'the') 114
('members', 'of', 'the') 110
(',', 'and', 'it') 109
(',', 'he', 'was') 109
('end', 'of', 'the') 108
('of', 'the', 'new') 107
('it', 'would', 'be') 107
(',', 'for', 'the') 106
('the', 'number', 'of') 104
('there', 'is', 'no') 104
('he', 'did', 'not') 103
('at', 'the', 'same') 103
('.', 'c', '.') 102
(',', 'and', 'then') 102
(',', 'she', 'said') 102
('the', 'use', 'of') 102
('in', 'fact', ',') 101
('on', 'the', 'other') 100
('he', 'said', '.') 100
(',', 'on', 'the') 99
Given trigrams is a dict that maps something to counts, it is
natural to convert it to a collections.Counter instance
and take advantage of the methods
available (e.g., collections.Counter.most_common() for accessing the most common k
items).
(2) Signatures to stems
sigs_to_stems = lxa_object.signatures_to_stems()
for sig, stems in sorted(sigs_to_stems.items(), key=lambda x: len(x[1]), reverse=True):
print(sig, len(stems))
if len(stems) < 50:
break
('NULL', 's') 2327
("'s", 'NULL') 813
('NULL', 'ly') 587
('NULL', 'd', 's') 346
('NULL', 'd') 314
('ed', 'ing') 197
("'", 'NULL') 190
("'s", 'NULL', 's') 181
('d', 's') 175
('ies', 'y') 173
('NULL', 'ed', 'ing', 's') 151
('NULL', 'ed') 134
('NULL', 'ed', 'ing') 130
('e', 'ed', 'es', 'ing') 130
('NULL', 'ing') 105
('d', 'r') 98
('e', 'y') 95
('e', 'ed', 'ing') 88
('ng', 'on') 85
('NULL', 'ed', 's') 82
('NULL', 'ly', 'ness') 74
("'", 'g') 72
('d', 'r', 'rs') 66
('NULL', 'es') 63
('NULL', 'ness') 60
('ng', 'on', 'ons') 57
('NULL', 'e') 51
('NULL', 'ally') 47
For all methods available to a Linguistica objects: Full API documentation