Automatic Metadata Generation for New Cards
F
Francesco Paterni
This proposal introduces automatic metadata creation for every new card generated within Recall AI.
When a card is created from external content, as well as for any content added to a blank card, the system should automatically generate and attach relevant metadata. This may include:
- Creation date
- Last updated date
- Source-related information (such as URL, author, and any metadata, evantually, already embedded in the original document)
The extracted metadata should be structured, editable, and consistently stored across all cards.
Additionally, metadata should be:
- Searchable and filterable within the Recall home interface
- Usable for sorting and organizing cards (e.g., by date, source, or author)
Linkable or associable with cards already stored in Recall, enabling better content relationships and deduplication
To enhance usability, the system could also:
- Allow users to manually edit or enrich metadata fields
- Suggest connections between cards based on shared metadata
- Provide metadata-based views (e.g., timeline, source grouping)
This feature would improve organization, discoverability, and knowledge linking across the platform.
Log In
P
Pete Bell
MyRecall's competitive position arguably depends on getting metadata right. If the pitch is knowledge intake at scale, then IMHO, weak metadata isn't a missing feature, it's a structural gap; search, linking, dedupe and views all rest on it. Treating this as a nice-to-have understates what's actually at stake.
A few suggestions to fold in:
- Schema before extraction: Without a controlled vocabulary, "filter by author" returns John Smith, Smith J., Smith John, and "by John Smith for The Atlantic" as four separate values. Anchor to Dublin Core or schema.org and extend from there (prevents the lock-in problem most users only notice when they try to leave).
- Separate the metadata classes: Bibliographic, system, provenance, semantic and relational fields have different capture mechanics and different trust profiles. Lumping them together produces a schema that's either bloated or vague, and vague schemas don't filter.
- Capture provenance per field: "Author: Jane Doe" pulled from an HTML meta tag, scraped from page text, inferred by an LLM, or entered by a user are four different reliability classes. Storing the origin of each field lets downstream features know what to trust and what to flag.
- Edit history, not just editability: If a user overwrites an extracted value, the original needs to survive somewhere with a change log. Otherwise corrections destroy evidence and there's no way to audit, revert, or distinguish a fix from a mistake.
- Source integrity: URLs rot. A snapshot captured at intake, or at minimum a Wayback fallback stored alongside the live URL, is what keeps "search by source" working past the link-rot horizon.
- Migration for existing cards: Most users already have hundreds or thousands of cards with nothing structured attached. New-card-only metadata leaves the back catalogue unsearchable on the same dimensions. A bulk retroactive pass with a review queue is harder than greenfield and needs its own design.
- Dedupe: URL match, content hash, title similarity and embedding-based near-duplicate detection have different failure modes. URL match misses syndicated content; title similarity over-merges; content hash misses near-duplicates. Pick which signal does which job rather than shipping one "dedupe" button.
- Views scale non-linearly: A timeline of fifty cards is a list. A timeline of five thousand needs aggregation and faceting, probably a different UX. Worth flagging now so v1 of this feature doesn't paint v2 into a corner?