Adding Language Support
This guide outlines the steps required to add parsing support for a new programming language to CodeGraphContext.
1. Architectural Integration
CGC uses a modular parsing system based on Tree-sitter:
TreeSitterParser(graph_builder.py): The primary generic wrapper that dispatches files to specific language sub-parsers.- Language Parser Modules (
src/codegraphcontext/tools/languages/): Individual python modules containing: - Tree-sitter AST tags queries (
<LANG>_QUERIES). - A
<Lang>TreeSitterParserclass inheriting from the parser interface. - A
pre_scan_<lang>method for rapid initial symbol caching. GraphBuilder: Dispatches files to language parsers, resolves imports, and feeds nodes/relationships to the persistence drivers.
2. Step-by-Step Implementation
Step A: Create the Language Parser Module
Create a new file under src/codegraphcontext/tools/languages/ (e.g., typescript.py).
Add standard parser imports:
from pathlib import Path
from typing import Dict, Any, List
from codegraphcontext.tools.languages.base import BaseParser
Step B: Define AST Tag Queries
AST tags are parsed using Tree-sitter query expressions. Define queries to target:
- functions: Standard functions, methods, arrow assignments.
- classes: Class and interface boundaries.
- imports: Syntax specifying external file or module dependencies.
- calls: Function or method invocations.
- variables: Variable declarations and assignments.
Tip: Use the CLI tree-sitter parse tool to inspect a sample source file's Concrete Syntax Tree (CST) and locate the correct node name keys.
Step C: Implement the Parser Class
Inherit from the base parser and implement AST extraction routines:
class TypescriptTreeSitterParser(BaseParser):
def __init__(self, generic_parser):
super().__init__(generic_parser, "typescript")
self.queries = self.load_queries()
def parse(self, path: Path, is_dependency: bool = False) -> Dict[str, Any]:
content = path.read_text()
tree = self.parser.parse(bytes(content, "utf8"))
# Populate and return standardized AST data structures
return {
"functions": self._find_functions(tree, content),
"classes": self._find_classes(tree, content),
"calls": self._find_calls(tree, content),
"imports": self._find_imports(tree, content),
"variables": self._find_variables(tree, content),
}
Step D: Implement the Fast Pre-Scan
Define a fast pre-scan routine to map declaration locations before linking call relationships:
def pre_scan_typescript(files: List[Path], parser_wrapper) -> Dict[str, Path]:
# Returns a dictionary mapping class/function symbol names to file paths.
...
Step E: Register the Parser
Map the file extension to the new parser class in parser_factory.py:
# Map extension inside the registry
SUPPORTED_LANGUAGES = {
".ts": "typescript",
".tsx": "typescript",
}
3. Verification & Diagnostic Queries
Once the parser is registered, verify graph extraction using sample source files:
- Index a test codebase:
cgc index ./tests/fixtures/sample_ts_project/ --force - Execute verification queries using Cypher:
- Verify files are parsed:
cgc query "MATCH (f:File) RETURN f.path, f.language" - Verify functions are identified:
cgc query "MATCH (f:File)-[:CONTAINS]->(fn:Function) RETURN f.path, fn.name" - Verify caller links:
cgc query "MATCH (caller:Function)-[:CALLS]->(callee:Function) RETURN caller.name, callee.name"
Emacs Lisp smoke check
Emacs Lisp support uses the elisp grammar already distributed by tree-sitter-language-pack; no external Emacs process or manual grammar compilation is required for the Tree-sitter path.
To smoke-test the checked-in two-file fixture against an isolated Kuzu database:
tmpdir=$(mktemp -d)
export PYTHONPATH=src
export DEFAULT_DATABASE=kuzudb
export CGC_RUNTIME_DB_TYPE=kuzudb
export CGC_RUNTIME_DB_PATH="$tmpdir/kuzu.db"
uv run python -m codegraphcontext index tests/fixtures/sample_projects/sample_project_elisp --force
uv run python -m codegraphcontext query "MATCH (f:File) WHERE f.path ENDS WITH '.el' RETURN f.name AS file ORDER BY file"
uv run python -m codegraphcontext query "MATCH (fn:Function) WHERE fn.lang = 'elisp' RETURN fn.name AS function ORDER BY function"
uv run python -m codegraphcontext query "MATCH (v:Variable) WHERE v.lang = 'elisp' RETURN v.name AS variable ORDER BY variable"
uv run python -m codegraphcontext query "MATCH (f:File)-[:IMPORTS]->(m:Module) RETURN f.name AS file, m.name AS module ORDER BY file, module"
uv run python -m codegraphcontext query "MATCH (caller:Function)-[:CALLS]->(callee:Function) WHERE caller.lang = 'elisp' RETURN caller.name AS caller_name, callee.name AS callee_name ORDER BY caller_name, callee_name"
rm -rf "$tmpdir"
Expected results include foo-core.el and foo-ui.el, function nodes such as foo-core-greet and foo-ui-render, variable nodes such as foo-core-count and foo-core-loud, module nodes for cl-lib, foo-core, and foo-ui, and direct call edges including foo-ui-render -> foo-core-greet and foo-core-greet -> foo-core-format.
Emacs Lisp SCIP follow-up
The initial Emacs Lisp implementation intentionally stays on the Tree-sitter pipeline. There is no standard scip-elisp indexer to register in EXTENSION_TO_SCIP, and the commonly used elisp-refs package is designed as an interactive Emacs reference finder rather than a batch indexer: it searches files recorded in the running Emacs load-history, renders results in a special buffer instead of emitting JSON or SCIP data, and exposes useful Lisp-2 function/variable heuristics only through internal APIs.
A future semantic indexer could reuse those heuristics in a dedicated batch wrapper, but it would still need directory discovery, side-effect-safe loading or buffer creation, line/column conversion from character offsets, structured output, and explicit handling for macro expansion and indirect calls. Until that exists, .el files should continue to use Tree-sitter indexing with documented limitations around arbitrary macro semantics and dynamic dispatch.