Skip to content

feat: apply PyPI heuristics to project-URL classification (#800)#1066

Draft
arun2dot0 wants to merge 1 commit into
CycloneDX:mainfrom
arun2dot0:feat/issue-800-url-heuristics
Draft

feat: apply PyPI heuristics to project-URL classification (#800)#1066
arun2dot0 wants to merge 1 commit into
CycloneDX:mainfrom
arun2dot0:feat/issue-800-url-heuristics

Conversation

@arun2dot0

Copy link
Copy Markdown

Description

This is a draft / work-in-progress opened early (as requested in the issue) to get feedback on the approach before completing it.

This extends how project URLs are classified into CycloneDX external-reference types, adopting PyPI's documented heuristics (https://docs.pypi.org/project_metadata/#icons). Today classification uses only the URL label against an exact-match dict in cyclonedx_py/_internal/utils/cdx.py. This PR moves toward classifying by both label (exact + prefix) and URL host (domain/subdomain), so emitted external references follow the de-facto standard.

Design (agreed in the issue thread)

  • Precedence (label-first): exact label → label prefix (PyPI * semantics) → host suffix → host subdomain-prefix → OTHER. The label is the author's explicit intent; the host fills gaps. E.g. a Funding label on a github.com URL stays OTHER, it does not become VCS.
  • Data/logic separation: all mapping rules live in a new data-only module cyclonedx_py/_internal/utils/url_classifiers.py (four declarative tables). The matcher in cdx.py is pure logic and never changes when rules are added — extending classification is a one-line data edit.
  • Mapping judgment calls (CycloneDX's enum is narrower than PyPI's icon set):
    • Funding / Sponsor / Donation / Donate → OTHER (no CycloneDX funding type; OTHER is more honest than forcing WEBSITE).
    • Chat vs Social split (refines PyPI's flat "social" bucket): Discord/Slack/Gitter → CHAT; Reddit/YouTube/Twitter-X/Mastodon/Bluesky → SOCIAL.
    • CI services (AppVeyor/CircleCI/Codecov/Coveralls/Travis) → BUILD_SYSTEM.
    • google.com left unmapped (too ambiguous) → falls through to OTHER.

Status / checklist

  • Task 1 — label classification (this commit): data-only module + exact & prefix label matching; url_label_to_ert(label, url=None) gains an optional url arg (unused until Task 2, keeps back-compat). New table-driven unit tests; flake8/isort/mypy clean.
  • Task 2 — host classification: host-suffix + subdomain-prefix tables, label-first precedence in the matcher, unit tests.
  • Task 3 — wiring: pass the URL through the three callers (poetry.py, pep621.py, packaging.py) and reconcile snapshots.

Feedback on the mapping table and the OTHER-for-funding choice is especially welcome before I finish Tasks 2–3.

Resolves or fixes issue: #800

AI Tool Disclosure

  • My contribution does not include any AI-generated content
  • My contribution includes AI-generated content, as disclosed below:
    • AI Tools: Claude Code
    • LLMs and versions: Claude Opus 4.8
    • Prompts: Find a good-first-issue, then design and implement #800: apply PyPI's project-URL classification heuristics (label + host based) to CycloneDX external-reference type detection. Constraints I directed: full port of PyPI heuristics mapped to the nearest CycloneDX type; label-first precedence; keep the mapping expandable and kept in a data module separate from the matching logic. TDD with table-driven unit tests.

Affirmation

Move URL-label -> external-reference-type mapping into a dedicated
data-only module and add PyPI-style label prefix matching. Adds an
optional `url` argument to url_label_to_ert (unused for now) for the
upcoming host-based classification.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Arun Selvamani <arunselvamani@gmail.com>
@codacy-production

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity · 0 duplication

Metric Results
Complexity 0 (≤ 20 complexity)
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant