Judge Specifications

The Martian SDK provides several types of judges that can be used to evaluate model outputs. Each judge type has its own specification class that defines its behavior.

RubricJudgeSpec

class martian_apart_hack_sdk.judge_specs.RubricJudgeSpec(model_type, rubric, model, min_score, max_score, prescript=None, postscript=None, extract_variables=None, extract_judgement=None)[source]

A specification for a rubric-based judge that evaluates submissions against defined criteria.

This class defines the configuration for a judge that uses a rubric and a language model to evaluate submissions. The judge applies the rubric using the specified model to generate a numerical score within the defined range.

Parameters:
  • model_type (Literal["rubric_judge"]) – The type of judge, must be “rubric_judge”.

  • rubric (str) – The evaluation criteria or rubric text that the judge will use to assess submissions.

  • model (str) – The identifier of the language model to be used for evaluation.

  • min_score (float) – The minimum possible score that can be assigned.

  • max_score (float) – The maximum possible score that can be assigned.

  • prescript (Optional[str]) – Optional instructions or context to be provided before the evaluation. This is included in the prompt that is sent to the judge, before the rubric.

  • postscript (Optional[str]) – Optional instructions or processing steps to be applied after the evaluation. This is included in the prompt that is sent to the judge, after the rubric.

  • extract_variables (Optional[Dict[str, Any]]) – Optional configuration for extracting variables from the evaluation.

  • extract_judgement (Optional[Dict[str, Any]]) – Optional configuration for extracting the final judgment details.

Notes

The default prescript is:

You are a helpful assistant that scores responses between ${min_score} and ${max_score} according to the following rubric:

The ${min_score} and ${max_score} are replaced with the min_score and max_score args.

The default postscript is:

Here's the conversation you are judging:
<content>
${content}
</content>

Please evaluate the assistant's response in the conversation above according to the rubric.
Think step-by-step to produce a score, and please provide a rationale for your score.
Your score should be between ${min_score} and ${max_score}.

Your response MUST include:
1. A <rationale>...</rationale> tag containing your explanation
2. A <score>...</score> tag containing your numerical score

The ${content} is replaced with the content of the conversation you are judging.

The ${min_score} and ${max_score} are replaced with the min_score and max_score args.

The full judging prompt looks like:

{filled_prescript}

<rubric>
{filled_rubric}
</rubric>

{filled_postscript}

Warning

If you override the default prescript or postscript, you must include the ${min_score}, ${max_score}, and ${content} tags in the prompt, and instruct the judge to include the <rationale> and <score> tags in the response. We do not recommend overriding the default prescript or postscript.

Examples

>>> rubric = '''
... You are tasked with evaluating whether a restaurant recommendation is good.
... The scoring is as follows:
... - 1: If the recommendation doesn't meet any of the criteria.
... - 2: If the recommendation meets only some small part of the criteria.
... - 3: If the recommendation is reasonable, but not perfect.
... - 4: If the recommendation is almost perfect.
... - 5: If the recommendation is perfect.
... '''
>>> rubric_judge_spec = RubricJudgeSpec(
...     model_type="rubric_judge",
...     rubric=rubric,
...     model="openai/openai/gpt-4o",
...     min_score=1,
...     max_score=5,
... )

Example Usage

from martian_apart_hack_sdk import judge_specs

# Create a rubric judge that evaluates responses on a scale of 1-5
rubric = """
You are tasked with evaluating whether a restaurant recommendation is good.
The scoring is as follows:
- 1: If the recommendation doesn't meet any of the criteria.
- 2: If the recommendation meets only some small part of the criteria.
- 3: If the recommendation is reasonable, but not perfect.
- 4: If the recommendation is almost perfect.
- 5: If the recommendation is perfect.
"""

judge_spec = judge_specs.RubricJudgeSpec(
    model_type="rubric_judge",
    rubric=rubric,
    model="openai/openai/gpt-4o",
    min_score=1,
    max_score=5,
)

Other Judge Types

The following judge types are also available but are primarily used internally or in advanced use cases:

  • GoldMatchJudge: Similar to RubricJudge but specialized for comparing responses against known good answers.

  • MaxScoreJudge: Takes multiple judges and returns the highest score among them.

  • MinScoreJudge: Takes multiple judges and returns the lowest score among them.

  • ConstantJudge: Always returns a fixed score and reason.

  • AverageScoreJudge: Takes multiple judges and returns their average score.

  • SumJudge: Takes multiple judges and returns the sum of their scores.

  • ExactMatchJudge: Checks if responses exactly match a set of known answers.

  • CaseJudge: Applies different judges based on conditional logic.

For most use cases, the RubricJudgeSpec is recommended as it provides the most flexibility and natural language understanding capabilities.