Documentation — Cognir Research

01 / Philosophy

The Problem We Solve

Research begins in chaos. A researcher does not wake up with a perfectly formed hypothesis — they wake up with a hunch, a contradiction, a pattern noticed in passing, or a frustration with existing literature. The gap between that raw cognitive state and a rigorous, defensible research question is where most projects die.

Traditional tools treat research as a search problem. You type a query, you get papers. But the researcher does not yet know what to query. They do not yet have the vocabulary. They have not yet articulated the boundary between what they know and what they need to know.

The COGNIR ONTOLOGY™ treats research as a transformation problem. It accepts unstructured, half-formed, emotionally charged human thought as input. It outputs ranked, evidence-grounded research questions and a curated literature pathway. The messy stage of research — the stage where most people quit — is compressed from weeks to hours.

Pipeline Stages

Per phase, fully autonomous

Enrichment APIs

Semantic Scholar, CrossRef, arXiv

∞

Query Variations

Synonym expansion + snowballing

02 / Capabilities

What the System Does

Intent Extraction

Parses unstructured, stream-of-consciousness researcher notes into structured semantic components: core problem, knowledge gap, key concepts, research domains, and notable themes.

Question Generation

Generates 10+ candidate research questions derived solely from extracted intent. No hallucination. No external injection. Every question is traceable to the user's original input.

Evidence Collection

Executes multi-query Serper searches, crawls priority academic domains (arXiv, Nature, PubMed, IEEE), and extracts structured metadata including abstracts, publication dates, and citation counts.

Viability Scoring

Multi-dimensional scoring across five axes: Research Activity, Academic Coverage, Specificity, Novelty, and Practicality. Weighted composite produces a final 0-100 viability score.

Literature Curation

Organizes discovered papers into six taxonomic categories: Foundations, Core Evidence, Frontiers, Methodology, Reviews & Meta-Analyses, and Controversies. Each paper is tagged with relevance score and reading priority.

Citation Snowballing

Recursively searches citations and references of top-scored papers to discover seminal works and recent developments that initial queries may have missed.

Phase 1

From Unstructured Ideas to Researchable Questions

The first engine accepts raw researcher cognition — notes, ramblings, half-formed hypotheses — and transforms it into a ranked set of 3 validated research questions. This is not keyword extraction. It is semantic archaeology: digging beneath the surface text to find what the researcher actually means.

Raw Text Understanding

The system performs foundational semantic extraction on the user's raw input. It does not generate questions yet. It does not infer beyond the text. It extracts exactly what is present: the core problem, the apparent goal, key concepts, research domains, and notable themes.

{
  "main_problem": "string describing the core problem",
  "main_goal": "string describing what the user wants to understand",
  "key_concepts": ["concept1", "concept2", "concept3"],
  "research_domains": ["domain1", "domain2"],
  "notable_themes": ["theme1", "theme2"]
}

Temperature: 0.2 Max Tokens: 2048 Model: Laguna M.1

Intent Compression

This is the critical interpretive layer. The system identifies what the user is actually trying to understand — not what they said, but what they meant. It detects the knowledge gap they are circling around and articulates the high-level research objective that would genuinely serve them.

Input

Extraction from Stage 1 (concepts, domains, themes)

Output

Research intent, knowledge gap, objective, primary domain

Constraint: All inference is grounded in the provided extraction. No hallucinated domains. No invented concepts. The system is explicitly prohibited from adding external knowledge.

Candidate Question Generation

Generates exactly 10 candidate research questions derived solely from the compressed intent and extracted concepts. Questions are varied in approach — some empirical, some theoretical, some applied — but all are researchable, specific, and traceable to the user's original input.

{
  "questions": [
    "How does X mediate the relationship between Y and Z in population P?",
    "What is the comparative efficacy of A versus B under condition C?",
    "..."
  ]
}

Search Query Generation

For each candidate question, the system generates 3 targeted academic search queries. These are not the full question text. They are short (3-7 words), academic-focused, and varied in angle. The system uses Google Scholar operators: intitle:, author:, source:, quoted phrases, and boolean OR.

Example Query Set

intitle:"cognitive load" AND "working memory"

author:Smith "working memory capacity"

"cognitive load theory" filetype:pdf

Evidence Collection

Executes Serper searches for each query set, then crawls priority academic domains with concurrency control. The crawler extracts title, abstract, headings, body text, publication date, and keywords. Priority domains are queried first: arXiv, Nature, Springer, IEEE, ACM, PubMed, NIH, ResearchGate, Frontiers, ScienceDirect, JSTOR, and others.

URLs per Q

Max Concurrent

Timeout

500ms

Delay

Resilience: If a crawl fails, the pipeline continues. Evidence is additive, not blocking. The system degrades gracefully.

Viability Scoring

Each candidate question receives a composite viability score (0-100) computed across five weighted dimensions. This is deterministic scoring, not LLM inference. It runs fast and produces reproducible results.

Research Activity

Recent papers, priority domain hits, abstract quality

25%

Academic Coverage

Priority domain hits × total source volume

20%

Specificity

Optimal word count (10-35), structural precision

20%

Novelty

Recency of evidence, ongoing research signals

20%

Practicality

Measurable indicators, empirical testability

15%

LLM Validation Pass

The top 5 candidates are submitted to a second LLM pass for evidence-backed validation. The model evaluates whether the question is active (ongoing research exists), viable (answerable with current methods), and researchable (empirically or theoretically investigable). It does not hallucinate — it bases its judgment on the compressed evidence summary provided.

{
  "is_active": true,
  "is_viable": true,
  "is_researchable": true,
  "evidence_assessment": "brief assessment",
  "confidence": 0.0 to 1.0,
  "key_finding": "most important finding"
}

Question Improvement

Each validated question is refined for specificity and measurability. The system preserves the exact original topic and intent — it does not change the subject matter. It adds specificity (who, where, what population, what context) and makes the question measurable where possible. If improvement fails, the original question is retained.

Ranking & Selection

Final composite ranking combines the deterministic viability score with validation bonuses (active +10, viable +10, researchable +5) and confidence-weighted adjustments. The top 3 questions are selected and presented with full provenance: original draft, improved version, evidence sources, scoring breakdown, and a direct handoff to Phase 2.

Handoff Protocol

Clicking "Research This Question" on any ranked result automatically populates Phase 2 with the refined question, bypassing the raw-input stage entirely. The two engines are designed to chain seamlessly.

Early Access

See the Ontology in action.

The documentation is comprehensive. The system is more so. Request access to experience the full pipeline on your own research.

Phase 2

LLM-Guided Comprehensive Search of the Entire Web

The second engine accepts a refined research question (from Phase 1 or direct input) and produces a structured, categorized reading list with full provenance. It does not just find papers. It understands the topology of a research field and maps the user's position within it.

Question Decomposition

The system decomposes the research question into structured components for systematic literature search. This includes domain classification, MeSH term extraction, sub-question generation, population/exposure/outcome identification, confounder detection, and controversy mapping.

{
  "domain": "primary scientific domain",
  "summary": "1-2 sentence restatement",
  "key_concepts": ["concept1", ...],
  "mesh_terms": ["MeSH term 1", ...],
  "sub_questions": ["specific sub-question 1", ...],
  "population": "who/what is studied",
  "exposure_intervention": "what is being tested",
  "outcome": "what is being measured",
  "confounders": ["potential confounder 1", ...],
  "known_controversies": ["controversy 1", ...],
  "inclusion_criteria": "what papers to include",
  "exclusion_criteria": "what papers to exclude",
  "sparse_areas": ["underrepresented topics"],
  "suggested_journals": ["journal name 1", ...]
}

Search Strategy Planning

The system generates 14-20 varied search queries across 10 strategic categories. This is not brute force. It is surgical precision: primary queries for direct hits, secondary queries for foundational papers, review queries for meta-analyses, methodology queries for study design, author queries for key researchers, date-range queries for recency or classics, synonym expansions for coverage, grey literature queries for reports, and preprint queries for cutting-edge work.

Primary (6-8): intitle:, quotes, MeSH terms

Secondary (3-4): Broader, foundational focus

Review (2-3): Systematic review, meta-analysis

Methodology (1-2): Study design focus

Author (1-2): author:Lastname operator

Date (1-2): after:YYYY, before:YYYY

Synonym (2-3): term1 OR term2 OR term3

Preprint (1-2): site:arxiv.org, site:biorxiv.org

Multi-Query Discovery

Executes all planned queries against Serper with intelligent caching. Results are deduplicated in real-time. Each query returns up to 10 organic results. The system respects rate limits with 250ms delays between calls and serves cached responses when available (48-hour TTL).

Caching Strategy

Query results are hashed and stored in localStorage with a 48-hour expiry. This dramatically reduces API costs and improves response times for repeated or similar research topics. Cache keys are content-addressed: serper:<hash(query)>.

Deduplication & Filtering

A three-layer deduplication pipeline: (1) Fast fuzzy title matching in JavaScript with 82% similarity threshold and prefix hashing. (2) Academic domain filtering against an expanded whitelist of 25+ domains and title pattern matching. (3) LLM semantic deduplication for near-duplicates and off-topic removal on the top 45 candidates.

Metadata Enrichment

The top 40 deduplicated candidates are enriched via three free academic APIs in priority order: Semantic Scholar (abstracts, citations, venues, PDF links), CrossRef (DOI resolution, publication dates, author lists), and arXiv (preprint abstracts, submission dates). Title similarity matching ensures the correct paper is retrieved even when exact titles differ slightly.

Semantic Scholar

Abstracts, Citations, PDFs

CrossRef

DOI, Dates, Authors

arXiv

Preprints, Summaries

Relevance Scoring

Enriched papers are scored in batches of 12 via LLM evaluation. The model receives the research question, decomposition, and full enriched metadata (abstract, year, venue, citations, authors). It returns relevance (0-100), quality signal, paper type, category, priority, and a 1-2 sentence justification. Papers scoring below 40 are excluded.

Scoring Rubric

85-100: Directly answers a sub-question, high quality, strong venue or high citations.
65-84: Relevant, good background or methods.
40-64: Tangentially useful.
<40: Excluded.

Citation Snowballing

The top 6 must-read and high-relevance papers become seeds for recursive discovery. The system generates snowball queries targeting citations, references, and co-author networks. Novel candidates are deduplicated against existing results, enriched, and scored. This closes the coverage gap that initial queries inevitably leave open.

Snowball Query Types

"[Title]" cited by

"[Title]" references

author:[Author] [concepts]

"[Title]" filetype:pdf

Reading List Curation

The final curation stage structures the scored papers into a logical reading sequence. A senior-researcher LLM persona evaluates the full pool and produces: an overview synthesis, key themes, a 5-10 paper reading order, a single "start here" recommendation, and six taxonomic categories. The system is ruthless: maximum 25 papers total. Every paper must earn its place.

Foundations

Classic papers & background

Core Evidence

Direct answers to the question

Frontiers

Recent & emerging research

Methodology

Study design & measurement

Reviews & Meta

Systematic reviews

Controversies

Conflicting evidence & nulls

Gap Analysis & Synthesis

The final stage identifies what is missing. A methodologist LLM evaluates the curated list against the original decomposition, flags sparse areas, and suggests next search strategies. It reports coverage confidence (high/medium/low) with specific reasoning. This is not a generic disclaimer — it is a targeted assessment of whether the current list adequately covers the research question.

{
  "gaps": [
    {"area": "gap topic", "description": "what's missing and why it matters"}
  ],
  "coverage_summary": "1 sentence on what the list covers well",
  "next_steps": ["suggested next search strategy 1"],
  "confidence": "high|medium|low",
  "confidence_reason": "brief reason"
}

03 / Architecture

API & Infrastructure

LLM Provider: OpenRouter

The system routes all LLM calls through OpenRouter, enabling multi-key rotation for resilience. Four API keys are maintained in a round-robin pool with automatic failover. If one key exhausts its rate limit or fails, the next key is attempted immediately. After two full passes through the pool, the system backs off with exponential delay.

Poolside Laguna M.1 (Phase 1) GPT-OSS 120B (Phase 2) Temperature: 0.15-0.2 Max Tokens: 900-2400

Search Provider: Serper

Google Search API via Serper.dev. Returns organic results with title, snippet, URL, and position. Supports up to 10 results per query. All responses are cached locally for 48 hours to minimize API usage and improve latency on repeated topics.

Enrichment APIs

Three free, no-key academic APIs provide metadata enrichment. Each has a 168-hour (7-day) cache TTL. Title similarity matching prevents false positives when exact titles differ.

Semantic Scholar

graph/v1/paper/search

CrossRef

/works

arXiv

export.arxiv.org/api

Web Crawler

Uses allorigins.win CORS proxy for cross-origin page fetching. DOMParser extracts structured content: title, meta description, abstract selectors, headings (h1-h3), body text (max 1500 chars), publication dates, and keywords. Noise elements (scripts, nav, ads, sidebars) are stripped before extraction. Priority domain sorting ensures academic sources are crawled first.

04 / Trust

Security & Ethics

No Data Retention

Research inputs are processed in real-time and never stored on Cognir servers. All caching is local to the user's browser via localStorage. No training data is collected from user queries.

No Hallucination Policy

Every output is traceable to either the user's input or retrieved evidence. The system is explicitly instructed to not infer beyond provided text. When evidence is insufficient, the system reports low confidence rather than inventing sources.

API Key Rotation

OpenRouter keys are rotated automatically with exponential backoff. No single key bears full load. Failed keys are logged but never exposed to the user interface. The system degrades to partial results rather than failing entirely.

Academic Integrity

The system does not write original research, fabricate data, or generate citations that do not exist. It is a discovery and curation tool, not a content generator. All paper links are direct to source publishers or preprint servers.

05 / Future

Roadmap

Q3 2026 — Private Beta

200 researchers. Full two-phase pipeline. Export to Zotero, Mendeley, and BibTeX.

Q4 2026 — Collaborative Workspaces

Shared research projects, annotation layers, advisor review workflows, institutional licenses.

Q1 2027 — Live Literature Monitoring

Automated alerts for new papers matching your research questions. Weekly digest of frontier developments.

Q2 2027 — Causal Inference Layer

Automated identification of causal claims, confounder analysis, and study design quality assessment.

Early Access

You have read the documentation.

You now understand exactly what the system does, how it does it, and why it is built this way. The only thing left is to use it. We are accepting 200 researchers for the private beta. If you are serious about your research, this is where you request access.

No commitment. No credit card. Just research.

The Cognir Ontology™

The Problem We Solve

What the System Does

Intent Extraction

Question Generation

Evidence Collection

Viability Scoring

Literature Curation

Citation Snowballing

From Unstructured Ideas to Researchable Questions

Raw Text Understanding

Intent Compression

Candidate Question Generation

Search Query Generation

Evidence Collection

Viability Scoring

LLM Validation Pass

Question Improvement

Ranking & Selection

See the Ontology in action.

LLM-Guided Comprehensive Search of the Entire Web

Question Decomposition

Search Strategy Planning

Multi-Query Discovery

Deduplication & Filtering

Metadata Enrichment

Relevance Scoring

Citation Snowballing

Reading List Curation

Gap Analysis & Synthesis

API & Infrastructure

LLM Provider: OpenRouter

Search Provider: Serper

Enrichment APIs

Web Crawler

Security & Ethics

No Data Retention

No Hallucination Policy

API Key Rotation

Academic Integrity

Roadmap

You have read the documentation.