feat: MCP server refactor with fine-grained tools and incremental update support#66
feat: MCP server refactor with fine-grained tools and incremental update support#66mambo-wang wants to merge 27 commits into
Conversation
🤖 Generated with [Qoder][https://qoder.com]
- Add IDE_DRIVEN_GUIDE.md with complete walkthrough for using CodeWiki with AI IDEs (CodeBuddy, Cursor, Claude Desktop) via MCP - Update README with IDE-Driven Mode section and navigation link
- 8 个模块文档(Agent 工具、CLI 工具、CLI 核心、MCP 服务、依赖分析器、共享配置、前端服务、后端核心) - 仓库总览 overview.md - 模块聚类树 module_tree.json - 全部文档含 Mermaid 架构图,语法校验通过
- Add _detect_changes() with git diff + mtime dual-strategy detection - Add _find_affected_modules() to map changed files to affected modules - analyze_repo now returns a 'changes' field with affected/cascade modules - Decouple codewiki/__init__.py from CLI imports for lightweight MCP startup - Update skill and IDE_DRIVEN_GUIDE.md with incremental update docs
Previously, CLIDocumentationGenerator never received or forwarded the git commit SHA, so metadata.json always had commit_id: null. This made --update fall back to full regeneration every time. Now the commit hash is obtained before generator creation and threaded through to the backend DocumentationGenerator, matching the behavior already present in Web mode (background_worker.py).
Square logo for GitHub repo avatar and wide banner for README header. Design follows the blue-purple-green gradient palette from the original CodeWiki framework diagram, with a red CN badge for branding.
Square logo corners and banner surrounding area are now transparent instead of white/light-gray, suitable for any background color.
Remove transparent padding around the rounded rectangle by filling corners with a matching dark navy gradient, making the banner a solid rectangle.
- Add _detect_changes() with git diff + mtime dual-strategy detection - Add _find_affected_modules() to map changed files to affected modules - analyze_repo now returns a 'changes' field with affected/cascade modules - Decouple codewiki/__init__.py from CLI imports for lightweight MCP startup - Update skill and IDE_DRIVEN_GUIDE.md with incremental update docs
Previously, CLIDocumentationGenerator never received or forwarded the git commit SHA, so metadata.json always had commit_id: null. This made --update fall back to full regeneration every time. Now the commit hash is obtained before generator creation and threaded through to the backend DocumentationGenerator, matching the behavior already present in Web mode (background_worker.py).
- Reduce component_index from 500 to 100 items per page (max 200), drop depends_on from each entry (available via read_code_components) - Add offset/limit params to analyze_repo for pagination - Add list_components tool for browsing components without re-analysis - Reduce leaf_nodes from 100 to 50 - Remove IDE rule files, consolidate into skill files
… hang - Wrap synchronous tool handlers in asyncio.to_thread() to avoid blocking the event loop (analyze_repo excluded — Tree-sitter C extensions are not thread-safe) - Disable mermaid-py validation by default (set MERMAID_VALIDATE=1 to enable), add 15s timeout to prevent indefinite hangs
- Fix shell injection in view_repo_file: replace shell=True subprocess with pathlib iteration - Add path traversal guards in view_repo_file, write_doc_file, edit_doc_file (reject paths escaping repo/output dir) - Add threading.Lock to SessionStore for concurrent access safety - Cap max sessions to 10, evict oldest when full - Cap read_code_components to 50 IDs per call - Cap edit history to 20 entries per file - Store edit history as native dict instead of JSON string - Fix undo to run Mermaid validation after reverting content - Fix Mermaid validation to report "skipped" instead of false success - Fix Mermaid timeout to warn instead of silent pass - Fix pagination hint to point to list_components instead of analyze_repo - Clamp offset to non-negative in _build_component_index - Add smoke test covering all critical paths (25 assertions)
- Reduce _MAX_RESPONSE_LEN 32000→24000, _MAX_COMPONENTS_PER_CALL 50→20 - Add per-component source truncation at 8000 chars - Write metadata.json (git commit_id + timestamp) on close_session to enable incremental update detection on next analyze_repo - Update smoke test assertions to match new caps
- Rewrite MCP 服务.md: add list_components tool, thread-safe SessionStore, path traversal guards, incremental update mechanism, multi-layer truncation - Update 后端核心.md: document mermaid validation degradation strategy - Update overview.md: tool count 9→10, MCP module component count 27→38 - Refresh module_tree.json with new MCP components
Replace large data transmission through stdio MCP protocol with
file-based side channels, enabling support for larger codebases.
Key changes:
- Add SessionWorkspace for per-session disk workspace management
- Write component index, leaf nodes, source files to {repo}/.codewiki/sessions/
- Remove list_components and view_repo_file tools (agent uses native file reading)
- Remove all truncation/pagination limits from MCP responses
- Fix Windows GBK encoding issues with explicit utf-8 encoding
- Update skill SKILL.md to v2.0.0 reflecting new 8-tool architecture
- Regenerate wiki docs with updated module structure (19 docs)
anhnh2002
left a comment
There was a problem hiding this comment.
Nice refactor overall, the session/tools split is clean and the CLI commit_id fix is correct. A few things to address before merge.
Blockers: the mcp SDK is still not a declared dependency (the pyproject change only registers packages), and view_repo_file shells out with an interpolated path which breaks on spaces and allows command injection.
Should-fix: path traversal on agent-supplied filenames in the read/write tools, and the incremental-update feature only works for CLI-generated docs since the MCP flow never writes metadata.json. Details inline.
- Add mcp>=1.0.0 to pyproject.toml dependencies (fixes ModuleNotFoundError on fresh install) - Add explicit utf-8 encoding to FileManager file operations
|
✅ Issue 2 — code_reader.py 命令注入:当前的 code_reader.py 已经彻底重写,没有任何 subprocess 或 shell=True 调用,全部用纯 Python 的 Path 操作写文件,命令注入漏洞已消除。 overall: ✅ mcp SDK 已声明依赖(刚加的) |
anhnh2002
left a comment
There was a problem hiding this comment.
Nice work on the MCP refactor and the incremental-update support, the file side-channel design is clean. Two things to sort before merge.
First, unrelated content from the fork needs to come out: CodeWiki介绍.md and img/logo-banner.png are unrelated, and IDE_DRIVEN_GUIDE.md + skills/codewiki-wiki-generator/SKILL.md hardcode the fork name CodeWiki-CN and a mambo-wang/CodeWiki-CN clone URL that should point at upstream. The guide also references .codebuddy/.../RULE.mdc and .cursorrules rule files that aren't in the PR.
Second, a few code issues worth fixing: the off-by-start line numbers in edit_doc_file snippets, the Windows path-separator mismatch in mtime detection, the substring over-matching in _find_affected_modules, the missed staged changes in git detection, and the Mermaid validation being off by default while the docs advertise it as automatic. Details inline.
There was a problem hiding this comment.
This is a standalone marketing article and isn't related to the MCP refactor. It shouldn't land in the upstream repo, please drop it from the PR.
There was a problem hiding this comment.
This banner asset is unrelated to the feature. Please remove it from the PR.
|
|
||
| ```bash | ||
| # 1. Clone the project | ||
| git clone https://github.com/mambo-wang/CodeWiki-CN.git |
There was a problem hiding this comment.
This points at the fork (mambo-wang/CodeWiki-CN), and CodeWiki-CN is used throughout this file (the cd here and the three cwd: "/path/to/CodeWiki-CN" configs below). Please scrub these to the upstream FSoft-AI4Code/CodeWiki. Also, the CodeBuddy/Cursor sections reference .codebuddy/rules/codewiki-wiki-generator/RULE.mdc and .cursorrules, but neither file is included in this PR, so that setup won't work as written.
| @@ -0,0 +1,181 @@ | |||
| --- | |||
| name: codewiki-wiki-generator | |||
| description: "Generate Wiki documentation for code repositories using CodeWiki-CN MCP tools. Use this skill when the user asks to generate a Wiki, code documentation, repository documentation, or analyze codebase structure. Requires CodeWiki-CN MCP server to be configured." | |||
There was a problem hiding this comment.
The fork name CodeWiki-CN is hardcoded here ("using CodeWiki-CN MCP tools", "Requires CodeWiki-CN MCP server") and again in the body ("Use CodeWiki-CN's MCP tools"). Please change these to the upstream CodeWiki name.
| lines = new_content.split("\n") | ||
| start = max(0, replacement_line - 4) | ||
| end = min(len(lines), replacement_line + new_str.count("\n") + 5) | ||
| snippet = "\n".join(f"{i + start + 1:6}\t{lines[i]}" for i in range(start, end)) |
There was a problem hiding this comment.
The line numbers in this snippet are wrong. i already runs over absolute indices from range(start, end), so i + start + 1 double-counts start (with start=10 the first line gets labeled 21 instead of 11). It should just be i + 1. Same issue in the insert branch at line 193. The agent reads this snippet back, so a wrong number can send its next insert to the wrong line.
| continue | ||
| try: | ||
| if filepath.stat().st_mtime > prev_time: | ||
| rel_path = str(filepath.relative_to(repo_path)) |
There was a problem hiding this comment.
On Windows relative_to yields backslash paths while component IDs use forward slashes, so _find_affected_modules never matches and incremental update silently touches nothing for non-git repos. filepath.relative_to(repo_path).as_posix() would fix it.
| components = mod_info.get("components", []) | ||
| hit = False | ||
| for comp in components: | ||
| if any(cf in comp or comp in cf for cf in changed_files): |
There was a problem hiding this comment.
cf in comp over-matches on short names, e.g. changing a.py flags any module containing data.py since "a.py" is a substring. Matching on a path/separator boundary would avoid the false hits.
| for item in repo.untracked_files: | ||
| if item not in changed: | ||
| changed.append(item) | ||
| for file_path in [d.a_path for d in repo.index.diff(None)]: |
There was a problem hiding this comment.
index.diff(None) only catches unstaged edits, so staged-but-uncommitted changes (git add without a commit) are missed. index.diff('HEAD') would cover those too.
| # mermaid-py spawns a Node.js subprocess that can hang indefinitely (e.g. when | ||
| # Node.js is missing or the mermaid CLI is misconfigured). Default to | ||
| # disabled; set MERMAID_VALIDATE=1 to enable. | ||
| _MERMAID_PY_BROKEN = os.environ.get("MERMAID_VALIDATE", "0") != "1" |
There was a problem hiding this comment.
With this default, validation is off unless MERMAID_VALIDATE=1, and since _PYTHONMONKEY_BROKEN is also true on 3.12+ (the minimum version), every write/edit returns "validation skipped". The tool descriptions and SKILL.md promise automatic Mermaid validation though, so either flip the default or update the docs so agents don't trust a check that isn't running.
| Component IDs look like ``src/main.py::MyClass``. We replace any | ||
| character that is not a word char, hyphen, or dot with ``__``. | ||
| """ | ||
| return re.sub(r"[^\w\-.]", "__", component_id) + ".src" |
There was a problem hiding this comment.
Different component IDs can sanitize to the same filename (mod/sub::X and mod__sub::X both become mod__sub____X.src), so one silently overwrites the other and read_code_components under-reports written. A short hash suffix would avoid the collision.
Summary
--update/commit_id) to MCPanalyze_repo, so only changed modules are re-analyzedcommit_idpassthrough tometadata.jsonin CLI mode for--updatesupportpyproject.tomlTest plan
python -m codewiki.mcp.serverand all new tools are listedanalyze_repotool, confirm session is created and dependency graph is builtanalyze_repoagain withcommit_idto verify incremental update works correctly--updateflag passes commit_id to metadata.json as expectedgenerate_docs,get_module_tree) still function