Translation Graph Markup Language Applications by Tom Veatch, 2018 Automatic and manual methods, either or both, can reasonably be used to operate on and use Translation Graphs. Reasonable operations would include segmenting, arc-labelling, translating, linking to segment pronunciations' audio files, etc. An initial formatter might for example: * add begin/end nodes. * optionally make the implicit arc explicit: a single arc with that whole document's text as its label. A segmenter might: * select an arc of a specified type with a complex label * split the label into sequential or simultaneous components of a second specified type * add nodes between sequential components and add arcs labelled with the respective components. * this segmentation process could be iterated over all arcs of a given type in the file. * manual or automatic methods could be done. * for example an automatic segmentation could be done in a script-teaching application, where each letter in the script gets its own arc of type "letter" and is parallel to ("simultaneous" with) an arc referring to a teaching resource for that letter (e.g., IPA or Roman equivalent, audio form) * or a document could be manually segmented with aid of emacs macros &c. A dictionary-connector might: * add arcs of a "dictionary-index" type parallel to and between the same node endpoints for arcs whose labels are found in a dictionary lookup. (e.g., if a hash or other index could save repetitious dictionary lookup of additional word instances in a document). A word-by-word, phrase-by-phrase, or sentence-by-sentence translator might: * add a new type and charset to the header. * add arcs of that type parallel to each word, phrase, or sentence arc, each with contents being the word/phrase/sentence's translation. * here "translation" may mean a mapping into any other linguistic level, for example, translate orthographic Sanskrit (where graphemes are derived from both words at a word boundary) to sequences of (separated, "underlying") morphemes. A parameterizeable display system might: * read and parse one or more TG files, constructing a (probably not very human-readable) TG data structure internally as a set of nodes with labels and arcs of various types between the segment-anchoring nodes. (Note that overlapping chunks as for example in non-agglutinative languages may require multiple nodes at a finer level of representation, with arcs covering more than a single node-to-node segment. One node preceding the first influence of a later form, and another following the last influence of the previous form.) * display the types available in a configuration UI for selection. * display the selected types in a linguistics-style, multilinear, tabular display, comprising * a line in the table for each selected type * links on selected types leading to a selected alternate type * e.g., click on a word in the word line to hear the audio of the word not shown in the text display but referenced in the TG as an audio arc corresponding to that word arc. * or e.g., click to pop up and choose from a menu of alternative data types available [ A L2Surface-Script2 (target document) I B L2Surface-Script1 (transliterated for readability) I C L2-Morphs/Stems-Script1 (analysed to show/learn structure) I D L1-Morphological translations (translated to learn meaning bits) I E L1-phrase translations (translated to support/guide learner study) ] A|B is a transliteration system, a little script-based FSM within UTF-8. A' should be a recorded spoken reading of the target document segment. B' should be recordings of each stem in isolated/pedantic/underlying form. |C is L2 morphological analysis into stems and inflections and tags. undo sandhi/mechanical/surface/phonological operations. Tag their application for learners to see. detect unit boundaries dictionary lookup of units; show if found, prompt for entry if not the units are post-inflectional morphology, lookup means we find surface form's root morphemes, inflections and tags So C is its own level as a sequence of tagged stems, but more so it is between levels, it provides the map to morphemes. |D is lexical information, the output of the dictionary lookup, namely the translated stems and inflectional content |E is phrasal translations made by a translator. Display engine should take a target document and try to display it, looking up audio bits, applying undo-sandhi, applying dictionary lookup, showing morpheme translations, showing a phrasal translation. It should show blanks or buttons where data is missing or analysis choices need to be made, prompting the user to input the relevant data, offering an input UX, and full-loop methods for storage & access & editing & rating methods. So write that. Maybe inside teachionary for self-publishing stuff, with a reversible surface phonology module with mysql for dictionary lookup for derivational morphology and translation to various L2s for audio clip reference with modes for document browse/view and for item-set training data entry UX multilinear display mode for both found-content, gaps to fill, errors to tag/fix. debug through enter/edit/view. user-id authentication and entry/edit authorship tracking. Start with no more than doc, dict, multi-linearity, blanks. then add enter/edit UX and etc. As JS web app. * Implementation could be via PHP or JavaScript mapping TG files to HTML with the intended UI functionalities. Dictionary lookup might be to a cloud located globally shared perhaps many-language resource. Video access might be to YouTube or other (universally) accessible video document store. An editing system might: * display a selected subset of the TG's types in multilinear tabular form * provide for creation of a new type, with the header tag and with arcs optionally exhausting the document (comments don't, translations do) and with a derived-from type for automatically generating a first-draft set of arcs of the new type. * provide a line in the multilinear tables for entering translations (data for that new type) parallel to a selected other type. It should have a charset, input method, and display font. * provide means of inserting nodes, e.g., for a click on a character to be interpreted as inserting a new boundary node before it and inserting arcs of the current type on its left and right. * provide for automatic pre-filling of arc contents via some perl-ish substitution s/// mapping from another type (e.g., orthography to phonology by some rule system) * provide for editing boundary locations (deleting and adding) (e.g., if automatically inserted in the wrong place) A language-teaching system might: * select a parameterization for the display system, * drive the user's reading through the system via highlighting displayed text bits (e.g., bouncing-ball) simultaneous with playback. * do a read-aloud game. Have a layer of arcs for L2 ASR grammar resources, highlight the next after the previous succeeds (or doesn't). * Ask the user what s/he wants to learn. * to learn an alphabet, provide: * links within text * bouncing-ball read-aloud one letter at a time * to read content with translations shown only of some selection of new words/morphemes (e.g. randomly selected at a certain percentage or frequency in the text, or selected by a teaching algorithm based on a model of the user's knowledge level which could be maintained by the system at a fine or gross level or configurable by the user also at a fine or gross level) * bouncing-ball read-aloud one word at a time * needs sub-sententially-aligned audio arcs * enable an isolated-word pronunciation mode via dictionary pronunciation audio arcs or via reference to a carefully pronounced rendition of the text. * to learn isolated-word vs vernacular-conversational pronunciation A Context of Application An extended example might be helpful to see utility in this quite abstract system presentation. Consider a context of historical document preservation such as being carried out by the Muktabodha Indological Research Institute which is saving disintegrating, family-stored document archives from oblivion. They have discovered in India ancient palm leaves covered with hand-copied historical texts being misused and often in bad condition, and they are committed to preserving these resources. For MIRI, the first step (after fundraising, hiring, training, advertising and networking, locating, persuading, travelling, unpacking, and setting up the equipment), the first step is the scanning of the found materials. From this a primitive TG could be produced as simply a sequence of scan filenames in a text file. After slight processing, it could be reformatted into a proper TG file with head, type, class, body, node, and arc tags in which the relevant TG layer type might be "original_scan_jpeg", and a sample node id might be "Hejamadi_Sanjeeva_Kunder_box_3_scan_3209" and an arc immediately after that node a filename reference to the particular scan. (If order of the pages scanned relative to one another is not known, then both start and end nodes could be given for a floater, and empty arcs specified as entering from 0 or leaving to -1.) An upcontrasted image set could be integrated into the TG formatting by adding another TG layer with type "contrast+120_scan_jpeg" with arcs referring to separate, corresponding image files. In this way, workflow can be carried out and tracked and reintegratable in TG layer files. Although readable in their direct image form to specialists, these scans then need to be processed into something useful for the rest of us. Does this situation suggest Translation Graphs? I hope so. So for example, passes made by improving purpose-trained OCR systems over the images might produce a lot of segmentations for example, top down into line_areas, string_areas, char_areas, feature- or glyph-stroke areas with their extracted parameters, and then bottom up into probability- or confidence-weighted character/word/morph hypotheses (perhaps multiple "simultaneous" hypotheses for a given single area, or multiple overlapping areas). A human-edited OCR transcription might be derived from the above, copied and reviewed, and approved after editing by a competent editor, with hypotheses confirmed/deleted/modified and content added/subtracted/changed as well as segmentation endpoints moved or removed or added or multiplied where the OCR produced bad segmentation. Obviously more work produces better results, and many drafts each provide their separate TG layer of translation of the now multiplying forms or glimpses of a theoretical, implied, underlying, intended document that the author of these ancient, perhaps disintegrating palm leaves bequeathed to us in that form. Such an edited transcription might then be built onto as added TG layers on the same document: * transliterated from its perhaps obscure script into a more accessible script such as devanagari or Roman * translated word by word into dictionary references or * referenced to a growing concordance * translated at a line/word/paragraph level to some L1 (type L1, charset ..., class doc_name author...) * rendered into audio by a reader, thence recorded into a digital file, made accessible to the system, and linked to by assigning segments of audio to start/end node spans. In short, scan them, then build up what you have into parsed, understandable TG documents readable by all. With the constellation of tools and operations described here, it is imaginable that ultimately any interested human could access and penetrate, could with minimal, if large, efforts, learn to read in the original, these preserved archives. And the same systems could be used to provide teaching access to learners of a target language through movies in that language, suitably supported by transcriptions and dictionaries and translations, all displayed and prompted into the viewer's attention so that learning and understanding can be made as effortless as possible. ----------------------------------------------------------------------- Sometimes a differentially carefully spoken rendition might be a tier. The more renditions, the better, indeed, since translations are so variable. According to Dave Graff, back in the 2000's, a corpus of news reporting in Chinese or Arabic, translated into English, 10x, produced results that were always different! Only a rare short sentence came out the same across translators. Word choice, word order, pronominalization, all different, all reactions by natives were different. Usually not very significant but frequent and subtle. Everyone has their own take. Now, the purpose here is the data and interfaces to support a language-teaching, or language-learning-supportive, browser. Of course machine learning in multiple iterations has its role to play. Based on initial work by a linguist, the machine learning algorithms will improve their transductions to preliminarily populate added tiers. Then linguists will improve the machine-generated drafts. Then the machines will continue learning. Presented with a gray box for a word proposed by algorithm, a human decider could click it to see alternatives or type (or push a button and speak aloud) to enter a new one, and select a correct form, which the algorithm will learn from and use to continue to improve its hypotheses in that tier. Machine learning can help partially automate segmentation, lookup, translation, also vocabulary sorting, based on frequency, to help decide what learners should learn first, etc. The resulting picture here is a workflow encompassing an ongoing process of sustained translation into another language. One step might be called "transcribe": Convert jpegs to an ordered sequence of bounding boxes, then to characters by OCR, then correct those classifications by human, feeding that back into the OCR. Another step does morpheme translation, simultaneously with dictionary construction. A dictionary process that feeds labels learned so far, forward, would start labelling unlabelled words. The dictionary is not fixed but is a process, a growing and living dictionary. As Dave says, 80% of words may have no ambiguity but the other 20% will be 80% of the work. Multiply ambiguous, highly context dependent, related to Zipf's law. A long tail of infrequent words which are relatively clear, and very rare words which may be quite unknown. In this workflow toolset, human users will label and work away until achieving some kind of critical mass making it useful to others. Dave: Building the browser is going into a different direction from archiving. Archiving with the raw source material and analysis/translation is just the bare facts. Then mediating to a learner/reader is more: a tool that serves as an instructor, carrying on an ongoing dialog with the individual to know what they got out of it, how comfortable are you with this? Tom: Have an apple tv remote control that you can click when you don't understand something in the recent history of the current media playback; if it's media annotated with this kind of data, and the remote understands the meaning of the click as "Explain that to me", then the video can pause, an IGT be displayed, and the user can browse until they learn what they want, and click onward to continue. Dave: Useful in teaching learners is an intelligent use of concordances. Just consider the vocabulary building issue, the biggest issue in language learning is vocabulary access. Once you have a database, with occurrences of each word, for each word, show them all in context with all the conjugated forms of it, maybe you could expand with a ruleboook and a grammar and go further, but the actual contextualized found forms tell so much. The concordance is crucial. Consider learner support for an utterance-sized bit of linguistic data in the form of a two-way IGT from LS (language of source) to LT (target language, reader's language) back to LS, each direction comprising several tiers including morphemic analysis, word-level translations, and full translations. Enable tagging/editing by permitted contributors such as (a) a linguist, and (b) the author or even (c) an interested person, to mark errors, questionables, or introduce to corrections at suitable layers. Provide concordances for one or more items in the structure and enable concordance of others by a menu operation on the item. Dave: Apply this to multi-lingual Twitter feeds. People who want to understand the L2 twitter data (and authors who want to be understood) might contribute a lot of data checking and editing to such a system. Getting the community usefulness going, after some critical mass is achieved, with live dictionaries, live algorithms, and humans involved, it could become quite useful to all. Tom: I want anyone to go into an L2 situation and be maximally supported to learn and understand what they don't know. This is far more ambitious than the Star Trek universal translater. It applies equally to multi-lingual Twitter, to foreign watchers of previously unsubtitled English movies, to audio concordances for learning dialect features, to archiving and study of ancient religious texts, to any form of language whether textual audio or video that is of interest to the point it is worth doing the work on it to make it accessible to another language. Put an app into your iPad and watch the TV with it, when it recognizes a place in the film where someone has made a tutorial out of it, the app provides for the user to click a button and see and go through an IGT to learn -- on the iPad, if the TV isn't smart enough to show it on the TV. Or have it be knowledgeable about you enough to pause and give you a translation of something it thinks will be helpful to you, once in a while. And you can click ? here or there as a question about what did that mean, and it can help. Even partly understanding native speakers can use similar controls over presentation of the Translation Graphs to turn the subtitles on and off. ------------------------------------------------------------------- After a request for a translation of a paper I wrote into French: I imagine providing a web UI for crowdsourcing translation tasks, exposing, to begin with, some tiers of the original document as document, sections, paragraphs, sentences, words. Then I imagine populating some added, French tiers with Google translate data. I guess one could only see a sentence or two at a time within the UI; that's fine to begin with. The underlying data form would be: tiers in different files. Emacs would do as an editor to convert higher level segments to finer grained segments. Then some background processes would populate the French tier by doing Google Translate robo-requests. Another might cut the TGs in the files into bits and pump them into the MySQL database, so that various forms thereof could be accessed using various SQL queries, like SELECT...JOIN... Next a web UI in HTML probably enhanced with JavaScript or other DataTable system, some kind of editable, displayable, automatically populatable tables, to show and provide for editing/correction/entry of various tiers. The editing process, when a change is made in some box of the table, would trigger code to send changes not to a text file formatted as a Translation Graph, but to the MySQL database storing correspondences suggested by the user. A table saying, French_sentence_by_Google maps to French_corrected_sentence with a column in the table for the contributing user's ID. More tables for dictionary entries. Etc. Maybe the UI allows a click to expand a part where the translation seems queer to the reader, offering that as a filter on the automatic translations, so readers can pick out queer bits and just fix them, and meanwhile read on. Now, why didn't I notice that the segmentation of words in French is not consistent in ordering with the segmentation of words in English, when the word order is different. I suppose that's okay. Or is it? IGTs use a base language, the language of the linguist, for the morphemic translations, but given in the observed sequence of the L2 morphemes. Then the base language morphemes are scrambled from that ordering to a base language phrase or sentence translation. Then if you build it the other way around, the scramblings won't match up. But perhaps the phrases/sentences will, at a higher level. Some aspect of ordering will remain, and that's what the TG node structure will expose. And the correspondences will work, but by matching longer segments together between two shared boundaries, rather than by directly corresponding in order at the smaller-segment level. Perhaps some language of permutation could be encoded in the graph so that word correspondences could be directly read out of it. Meanwhile, not. ------------------------------------------------------------------- Consider an example workflow: * Create document * Block out as 5 lines or paragraphs, so far without content [#p1][#p2][#p3][#p4][#p5][p5#] * (Perhaps apply image processing or OCR to create some intermediate forms to focus and support manual transcription.) * Fill in the paragraphs (here as Roman character glyphs encoded as ASCII but use your own charset & editor/text entry method): [#p1]Om puurnam adah puurnam idam [#p2]puurnaat puurnam udachyate [#p3]purnasya puurnam aadaayaa [#p4]purnam eva vashishyate [#p5]Om shaanti shaanti shaanti [p5#] * Segment into "words" [#p1][#w1]Om[#w2]puurnam[#w3]adah[#w4]puurnam[#w5]idam[w5#] [#p2][#w6]puurnaat[#w7]puurnam[#w8]udachyate[w8#] [#p3][#w9]purnasya[#w10]puurnam[#w11]aadaayaa[w11#] [#p4][#w12]purnam[#w13]eva[#w14]vashishyate[w14#] [#p5][#w15]Om[#w16]shaanti[#w16]shaanti[#w17]shaanti[w17#] [p5#] * Segment inflectional morphemes [#p1][#w1]Om[#w2]puurn[#m1]am[#w3]adah[#w4]puurn[#m2]am[#w5]idam[w5#] [#p2][#w6]puurn[#m3]aat[#w7]puurn[#m4]am[#w8]udachya[#m5]te[w8#] [#p3][#w9]purn[#m6]asya[#w10]puurn[#m7]am[#w11]aat[#m8]aayaa[w11#] [#p4][#w12]purn[#m9]am[#w13]eva[#w14]vashishya[#m10]te[w14#] [#p5][#w15]Om[#w16]shaanti[#w16]shaanti[#w17]shaanti[w17#] [p5#] * Enter dictionary entries: L1: Sanskrit. L2: English om -> om puurn# -> whole, complete, perfect #am -> nom.sg. #aat -> ablative #asya -> genitive #aayaa -> subjunctive adah -> that idam -> this eva -> only vashishi -> remain #ate -> present udachi -> arise shaanti -> peace * Populate morpheme translation tier automatically from dictionary [#p1][#w1]Om[#w2]whole,complete,perfect[#m1]nom.sg. [#w3]that[#w4]whole,complete,perfect[#m2]nom.sg. [#w5]this[w5#] [#p2][#w6]whole,complete,perfect[#m3]ablative [#w7]whole,complete,perfect[#m4]nom.sg. [#w8]arise[#m5]pres.[w8#] [#p3][#w9]whole,complete,perfect[#m6]genitive [#w10]whole,complete,perfect[#m7]nom.sg.[#w11]abl.[#m8]subj.[w11#] [#p4][#w12]whole,complete,perfect[#m9]nom.sg.[#w13]only [#w14]remain[#m10]pres.[w14#] [#p5][#w15]Om[#w16]peace[#w16]peace[#w17]peace[w17#] [p5#] * Manually select dictionary entries from the list given in a context. The display should show words with multiple entries in a highlighted form with a menu representing the options, and making the transcriber's job easier to select the preferred option. [#p1][#w1]Om[#w2]perfect[#m1]nom.sg. [#w3]that[#w4]perfect[#m2]nom.sg. [#w5]this[w5#] [#p2][#w6]perfect[#m3]ablative [#w7]perfect[#m4]nom.sg. [#w8]arise[#m5]pres.[w8#] [#p3][#w9]perfect[#m6]genitive [#w10]perfect[#m7]nom.sg.[#w11]abl.[#m8]subj.[w11#] [#p4][#w12]perfect[#m9]nom.sg.[#w13]only [#w14]remain[#m10]pres.[w14#] [#p5][#w15]Om[#w16]peace[#w16]peace[#w17]peace[w17#] [p5#] * Manually translate from translated morphemes to English phrasing: [#p1][#w1]Om.[#w2]That is perfect. [#w4]This is perfect[w5#] [#p2][#w6]From the perfect [#w7]The perfect arises[w8#] [#p3][#w9]From the perfect [#w10]If the perfect is taken[w11#] [#p4][#w12]The perfect, only, remains[w14#] [#p5][#w15]Om! [#w16]Peace! [#w16]Peace! [#w17]Peace![w17#] [p5#] * Automatic editing procedures (such as emacs macros or eLisp functions) should be made available and easily invoked to: * construct or add to dictionary the words not presently found therein. * carry out segmentation of words, inflectional morphemes, etc. using some expanding/trainable ruleset, into a new tier. * copy a tier to be a new tier (pick from an inventory, enter new tier name) * substitute within a tier per dictionary mappings * enable text editing, click to select a segment, control-+ to expand to influce the next segment, type to replace the selection with new text. * Multi-tier editorial display should be provided, to see other tiers while editing a tier. * Presentation for learning may be computer controlled based on a model of the reader/learner's knowledge, or manually parameterized. Anyway we now have data to support the learner. A map to the meanings of the grammatical encodings like "abl" (ablative, 'away from'), "subj" (subjunctive, 'possibly') should be a click away. A map to a concordance for any morpheme should be a click away. A map to an IPA reference and to a pronunciation guide and script description should be a click away. An audio format where the text is performed in a recording with a bouncing ball display should be a click away. All this may be hidden and the image/video/audio media (dis-)played, with the display of all tiers or a parameterized, selected subset, a click away during playback when the audience is puzzled and wants to understand the part they just heard but didn't understand. ----------------------------------------------------------------------- I had a vision while sleeping, evidently fitfully, of translation scholars as being like Uber drivers, and this system scaled into to the cloud to support cloud-published documents, as like Uber. Multi use publication model though rather than single use service model, of course. But I short it is sort of enabling a craft industry of translators, who would get paid by the readership based on use. Naturally there will be a popularity contest effect where desired content will pay certain scholars more, a very few a very great deal more, and obscure languages, worse yet double obscure language pairs, will see insignificant use therefore insignificant income to the craft linguists doing the translations. But still a market is powerful in getting what people want created, and virtuous obscurities can be paid for by virtue-funding patrons and sponsors. Anyway a supporting system like this which makes the job both of active translators and of the learners accessing the L2 content so much easier, will support the maximum and fastest possible proliferation of translations just by the power of ease of content creation in it. Because it seems the lowly translator has not got enough of their due. They are paid their stipend or pittance by the publisher, as an overhead of their main publication business for their minor L2 market, and the state of translations of all things is poor weak, hidden, inaccessible, underproliferated. Maybe a general way to monetize the translators labors would change that. At the same time, translation as here being creation of content for learnability, in a way reduces the translator/linguist's long term power, significance, and position since pretty soon lots of others know that language and once learned they don't need that learning content any more and they can understand L2 in its own form with out the translator's support. There are various ramifications. But I think we are heading in a virtuous direction. A question that is coming soon: if baseline documents are just URLs, perhaps locked and stored, and TGs an overlay from a different cloud service, with learners using the TG UI to access that content more easily. Versus a developed TG for a given audience and language pair, self-published in a TG form. The former tends toward SAAS and the latter towards freely available internet standards such as a definition of TGML. I think both are required but the mix is unclear. JS UI code and modified browsers, probably some mix of open source and proprietary to keep some advantage for a supervised market. The best would be whatever enables the most active market of tech enabled and enabling craft linguists creating content that pays them the most by teaching the most to the most. SAAS seems to win on this front. Not inconvenient that an RDB be involved. So perhaps this has the chance to become an internet startup. On the other hand perhaps this is all old tech inside s MT division and we have nothing. Whatever, it needs to go just to save some Sanskrit documents, and also I need to see the world through this lens because I'm so frustrated that I don't understand every single language when I hear it! Indeed an ultimate target is live tech, where people use an L2 access and learning appliance to explore the world. Maybe it will reside in earplugs -- for auditory learners! For visual learners it will be JS enhanced web browsers with a cloud service to provide learnability to increasing fractions of the L2 world. For action based learners, what do we offer? There is a concept of shared performance of a play, a family or troupe downloads a shared play or document, and perhaps not phy is ally Co-present but as orchestrated by some TGish system, each takes t heir turn an records into the document their part. Or the TG UI can demand interactivity from the user, to learn what they know. Repeat and get some phonetic training for your accent, etc. A world of things here.