Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

JSON Output Reference

Status: Reference Last updated: 2026-05-11 23:45 EDT

This document describes the structure of JSON produced by chatter to-json. For the formal JSON Schema, see JSON Schema.

Quick Start

# Default: parse + validate + align, pretty-printed, schema-checked
chatter to-json file.cha

# Write to file
chatter to-json file.cha -o file.json

# Skip validation (parse only, faster)
chatter to-json file.cha --skip-validation

# Skip alignment only
chatter to-json file.cha --skip-alignment

Validation and alignment are on by default. Use --skip-validation or --skip-alignment to opt out.

Top-Level Structure

{
  "lines": [ ... ]
}

A ChatFile is a flat list of lines. Each line has a line_type discriminator:

line_typeDescription
"header"File header (@Begin, @Languages, @Participants, etc.)
"utterance"Main tier + dependent tiers + alignment
"comment"@Comment: lines

Word Fields

Words are the fundamental unit. Every word in the main tier content array carries these fields:

FieldTypeAlways?Description
type"word"yesDiscriminator
raw_textstringyesExact text from the transcript, including all CHAT markers
cleaned_textstringyesNLP-ready text (shortenings restored, markers stripped)
contentarrayyesStructured breakdown of word parts (see below)
categorystringno"omission", "filler", "nonword", "fragment", "ca_omission"
form_typestringnoSpecial form code: "c", "d", "f", "x", etc.
langobjectnoLanguage marker (see Language-Switched example)
untranscribedstringno"unintelligible" (xxx), "phonetic" (yyy), "untranscribed" (www)

Word content items use "content" for the text value:

{ "type": "text", "content": "dog" }

Computed Fields

cleaned_text and untranscribed are computed from content during serialization. They do not exist as stored fields in the data model.

  • cleaned_text: Concatenates Text and Shortening elements from content. Excludes lengthening markers (:), stress markers, CA elements, overlap points, compound markers, and underline markers. Example: sit(ting)"sitting".

  • untranscribed: Present only when cleaned_text is "xxx", "yyy", or "www".

Word Examples

Simple Word

dog
{
  "type": "word",
  "raw_text": "dog",
  "cleaned_text": "dog",
  "content": [{ "type": "text", "content": "dog" }]
}

Filler

&-uh
{
  "type": "word",
  "raw_text": "&-uh",
  "cleaned_text": "uh",
  "content": [{ "type": "text", "content": "uh" }],
  "category": "filler"
}

Untranscribed

xxx
{
  "type": "word",
  "raw_text": "xxx",
  "cleaned_text": "xxx",
  "content": [{ "type": "text", "content": "xxx" }],
  "untranscribed": "unintelligible"
}

Compound

ice+cream
{
  "type": "word",
  "raw_text": "ice+cream",
  "cleaned_text": "icecream",
  "content": [
    { "type": "text", "content": "ice" },
    { "type": "compound_marker", "content": { "span": { "start": 0, "end": 1 } } },
    { "type": "text", "content": "cream" }
  ]
}

Omission

0she
{
  "type": "word",
  "raw_text": "0she",
  "cleaned_text": "she",
  "content": [{ "type": "text", "content": "she" }],
  "category": "omission"
}

Nonword

&~baba
{
  "type": "word",
  "raw_text": "&~baba",
  "cleaned_text": "baba",
  "content": [{ "type": "text", "content": "baba" }],
  "category": "nonword"
}

Special Form

doggy@c
{
  "type": "word",
  "raw_text": "doggy@c",
  "cleaned_text": "doggy",
  "content": [{ "type": "text", "content": "doggy" }],
  "form_type": "c"
}

Language-Switched

maison@s:fra
{
  "type": "word",
  "raw_text": "maison@s:fra",
  "cleaned_text": "maison",
  "content": [{ "type": "text", "content": "maison" }],
  "lang": { "type": "explicit", "code": "fra" }
}

The lang field has variants: {"type": "shortcut"} (bare @s), {"type": "explicit", "code": "fra"} (@s:fra), and {"type": "multiple", "code": ["eng", "zho"]} (@s:eng+zho).

Utterances

An utterance line contains:

{
  "line_type": "utterance",
  "main": {
    "speaker": "CHI",
    "content": {
      "content": [ ... ],
      "terminator": { "type": "period" },
      "bullet": { "start_ms": 0, "end_ms": 3042 }
    }
  },
  "dependent_tiers": [ ... ],
  "alignments": { ... },
  "utterance_language": { "status": "resolved_default", "code": "eng" },
  "language_metadata": { ... }
}

Key structural points:

  • The utterance body is under "main", not "utterance".
  • content, terminator, and bullet are nested inside main.content.
  • terminator is an object with a type field ("period", "question", "exclamation", etc.), not a bare string.
  • bullet (utterance-level timing) is inside main.content, omitted when absent (not present as null).
  • dependent_tiers, alignments, utterance_language, and language_metadata are top-level siblings of main. Empty dependent_tiers and alignments are omitted when there is nothing to report.

Content Items

main.content.content is a heterogeneous array. Each item has a type discriminator:

TypeDescription
"word"A word token (see Word Fields above)
"event"Non-verbal action (&=laughs)
"pause"Timed or untimed pause ((.), (0.5))
"group"Bracketed group (<word word>)
"separator"Tag markers, linkers, etc.

Dependent Tiers

When present, dependent_tiers is an array of tagged objects:

"dependent_tiers": [
  {
    "type": "Mor",
    "data": {
      "tier_type": "Mor",
      "items": [
        {
          "main": { "pos": "pron", "lemma": "I" }
        },
        {
          "main": { "pos": "verb", "lemma": "want", "features": ["Fin", "Ind", "Pres"] }
        }
      ],
      "terminator": "."
    }
  },
  {
    "type": "Gra",
    "data": {
      "tier_type": "Gra",
      "relations": [
        { "index": 1, "head": 2, "relation": "NSUBJ" },
        { "index": 2, "head": 0, "relation": "ROOT" }
      ]
    }
  }
]
typeTierDescription
"Mor"%morMorphological analysis (POS tags, lemmas, features, clitics)
"Gra"%graGrammatical relations (dependency arcs)
"Pho"%phoPhonological transcription
"Sin"%sinSyntax tier
"Wor"%worWord-level timing (items with inline_bullet)
Other%xxxUser-defined dependent tiers

%wor Tier

The Wor tier contains word items with timing:

{
  "type": "Wor",
  "data": {
    "items": [
      {
        "kind": "word",
        "raw_text": "hello",
        "cleaned_text": "hello",
        "content": [{ "type": "text", "content": "hello" }],
        "inline_bullet": { "start_ms": 100, "end_ms": 300 }
      }
    ],
    "terminator": { "type": "period" }
  }
}

Note that %wor items use "kind" instead of "type" for their discriminator, since "type" is used by the tier envelope.

Alignment Data

When validation runs (the default), the alignments object contains:

  • units: per-tier index arrays (for internal bookkeeping)
  • Named tier pairs (e.g., mor, gra) with alignment mappings
"alignments": {
  "units": {
    "main_mor": [{"index": 0}, {"index": 1}],
    "main_pho": [{"index": 0}, {"index": 1}],
    "main_sin": [{"index": 0}, {"index": 1}],
    "main_wor": [{"index": 0}, {"index": 1}],
    "mor": [{"index": 0}, {"index": 1}]
  },
  "mor": {
    "pairs": [
      { "source_index": 0, "target_index": 0 },
      { "source_index": 1, "target_index": 1 }
    ],
    "errors": []
  }
}

Alignment links each main-tier word (source_index) to its corresponding dependent-tier item (target_index) by position. errors contains any alignment-level diagnostics (count mismatches, etc.) and is [] when alignment validated cleanly.

Headers

Headers use the header object with a type discriminator:

TypeHeaderKey Fields
"utf8"@UTF8,
"begin"@Begin,
"end"@End,
"languages"@Languagescodes
"participants"@Participantsentries (speaker_code, name, role)
"id"@IDlanguage, corpus, speaker, role, age, sex, …
"media"@Mediafilename, media_type, status
"comment"@Commenttext
"date"@Datedate
"options"@Optionsoptions (array of strings)

See the JSON Schema for the complete list of header types and fields.

Timing

Utterance-level timing appears in main.content.bullet:

"bullet": {
  "start_ms": 1234,
  "end_ms": 5678
}

Word-level timing (from %wor tier) appears in inline_bullet on individual words within the Wor dependent tier.