Configuration

๐Ÿ’ฝ Dataset selection
  • ๐ŸŸ๏ธ LMArena (Llama4 special)

  • ๐ŸŸ๏ธ LMArena (2024)

    • 10k subsample of LMArena text dataset (100k) released alongside Arena Explorer work, crowdsourced human annotations from between June and August 2024 in English, including topic labels automatically generated by Arena Explorer pipeline. LMArena is also known as Chatbot Arena. (Note: cross-annotated in 3 runs)
    • Source: https://huggingface.co/datasets/lmarena-ai/arena-human-preference-100k
  • ๐Ÿฆ™ AlpacaEval

  • ๐Ÿ•Š๏ธ Anthropic harmless

    • 5k subsample of human preference pairs favouring harmless responses from RLHF dataset by Anthropic. (Note: legacy annotations - used older annotation pipeline version)
    • Source: https://github.com/anthropics/hh-rlhf
  • ๐Ÿš‘ Anthropic helpful

    • 5k subsample of human preference pairs favouring helpful responses from RLHF dataset by Anthropic. (Note: legacy annotations - used older annotation pipeline version)
    • Source: https://github.com/anthropics/hh-rlhf
  • ๐Ÿ’Ž PRISM

    • ~8k human preference pairs from PRISM dataset, focused on controversial topics with extensive annotator information. Originally four-way annotations, subsampled using 1-of-3 rejected responses to get pairwise preferences. (Note: legacy annotations - used older annotation pipeline version)
    • Source: https://huggingface.co/datasets/HannahRoseKirk/prism-alignment
  • ๐Ÿ‹๏ธ OLMo-2 0325 pref-mix

  • ๐Ÿ”„ MultiPref

    • 10k preference pairs, each annotated by 4 human annotators as well as GPT-4-based AI annotators. Whilst each pair is annotated by 4 human annotators, these annotators are not identical across all pairs (i.e. more than four annotators overall worked on the dataset). (Note: legacy annotations - used older annotation pipeline version)
    • Source: https://huggingface.co/datasets/allenai/multipref
  • ๐ŸŽญ Model Personality Comparison

    • Model Personality Comparison dataset between openrouter/openai/gpt-4o-2024-11-20, openrouter/openai/gpt-4.1-mini, openrouter/x-ai/grok-4, openrouter/google/gemini-2.5-pro, openrouter/moonshotai/kimi-k2, openrouter/meta-llama/llama-4-maverick, openrouter/mistralai/magistral-medium-2506, openrouter/anthropic/claude-sonnet-4, openrouter/openai/gpt-oss-20b, openrouter/openai/gpt-5, openrouter/mistralai/mistral-medium-3.1, openrouter/anthropic/claude-sonnet-4.5, openrouter/z-ai/glm-4.6, openrouter/anthropic/claude-haiku-4.5, openrouter/google/gemini-3-pro-preview, openrouter/openai/gpt-5.1, openrouter/openai/gpt-5.1-chat, openrouter/allenai/olmo-3-32b-think, openrouter/allenai/olmo-3-7b-instruct, openrouter/allenai/olmo-3-7b-think, openrouter/mistralai/mistral-large-2512, openrouter/google/gemini-3.1-pro-preview, openrouter/mistralai/mistral-medium-3-5, openrouter/anthropic/claude-opus-4.7. Using openrouter/openai/gpt-4o-2024-11-20 as reference model(s). Created and annotated using Feedback Forensics, see https://huggingface.co/datasets/rdnfn/ff-model-personality for more details.
    • Source: Unknown source
๐Ÿ”Ž Analysis mode
โš ๏ธ Some configuration options (grouping by column, selecting multiple col annotators) only work correctly when selecting a single dataset. Select a single dataset to use these features.

Results

๐ŸŽ›๏ธ View

Numerical overview

Overall statistics

See guide here for metric details


Annotation metrics

๐Ÿ‘‰ Click on values to view example datapoints | See guide here to learn how each metric is computed and can be interpreted

Metric
Sort by
Sort order

Datapoint viewer

Controls

๐Ÿ‘ฅ Annotator 1
๐Ÿ‘ฅ Annotator 2
๐Ÿ” Filter subset
0 100

Datapoint

Feedback Forensics app v0.5.0