NLTK Functions:
s = stem.{Porter,etc.}; s.stem(token)
list(tokenize.whitespace(text))
t = tag.{Default,Regexp(patterns),Unigram(backoff=t)}; t.tag(tokens); t.train(corpus);
g = cfg.parse_grammar(gmr); char.ChartParse(g,METHOD);
TGTK:
tier = TG.readTier(fn1); // on a plain text file makes a tier on whitespace
tier.write(fn2); // output the tier into a file, saving it.
doc = readDoc(fn3); // initial tier
for (fnI;;) doc.AddTier(fnI); // returns T if consistent & well-formed.
for (seg = doc.FirstSegment(tiername); seg && seg.hasNext(); seg = seg.nextSegment(tiername)) {
// segment is an object that selects a segment or arc between nodes on a tier,
// thus it defines precedessor and successor nodes at that tier
// as well as the including arcs on larger-segment tiers
// as well as the include arc sequences on shorter-segment tiers
// handle the segment
}
Each tier has a name and access method and for each arc between adjacent nodes within the tier it has data.
The underlying data is provided to a caller by calling the access method given the data.
So it might be,
Yes
Father
Yes Father
Then calling code could call
doc.FirstSegment("WordsTier").access("WordsTier");
to retrieve the ASCII string "Yes" which we thereby take as the actual first word,
or it could call
doc.FirstSegment("WordsTier").NextSegment("WordsTier").access("WordsTier")
to retrieve the ASCII "Father", the actual second word.
As another example:
IMG1.jpg
IMG2.jpg
Then calling code could call doc.FirstSegment("ImagesTier").access()
to retrieve the filename "IMG1.jpg"
To automatically process a tier of data, generating a second tier of data to be incorporating into the same Document,
code something like this might do:
stemstier = doc.AddTier(CopyNodesFromTier="Words","StemsTier");
s = doc.FirstSegment("Words");
stemstier.replace_with(s,stemmer(s.access("WordsTier")
s.NextSegment("Words")