Judge Specifications

The Martian SDK provides several types of judges that can be used to evaluate model outputs. Each judge type has its own specification class that defines its behavior.

RubricJudgeSpec

class martian_apart_hack_sdk.judge_specs.RubricJudgeSpec(model_type, rubric, model, min_score, max_score, prescript=None, postscript=None, extract_variables=None, extract_judgement=None)[source]

A specification for a rubric-based judge that evaluates submissions against defined criteria.

This class defines the configuration for a judge that uses a rubric and a language model to evaluate submissions. The judge applies the rubric using the specified model to generate a numerical score within the defined range.

Parameters:

model_type (Literal["rubric_judge"]) – The type of judge, must be “rubric_judge”.
rubric (str) – The evaluation criteria or rubric text that the judge will use to assess submissions.
model (str) – The identifier of the language model to be used for evaluation.
min_score (float) – The minimum possible score that can be assigned.
max_score (float) – The maximum possible score that can be assigned.
prescript (Optional[str]) – Optional instructions or context to be provided before the evaluation. This is included in the prompt that is sent to the judge, before the rubric.
postscript (Optional[str]) – Optional instructions or processing steps to be applied after the evaluation. This is included in the prompt that is sent to the judge, after the rubric.
extract_variables (Optional[Dict[str, Any]]) – Optional configuration for extracting variables from the evaluation.
extract_judgement (Optional[Dict[str, Any]]) – Optional configuration for extracting the final judgment details.

Notes

The default prescript is:

You are a helpful assistant that scores responses between ${min_score} and ${max_score} according to the following rubric:

The ${min_score} and ${max_score} are replaced with the min_score and max_score args.

The default postscript is:

Here's the conversation you are judging:
<content>
${content}
</content>

Please evaluate the assistant's response in the conversation above according to the rubric.
Think step-by-step to produce a score, and please provide a rationale for your score.
Your score should be between ${min_score} and ${max_score}.

Your response MUST include:
1. A <rationale>...</rationale> tag containing your explanation
2. A <score>...</score> tag containing your numerical score

The ${content} is replaced with the content of the conversation you are judging.

The ${min_score} and ${max_score} are replaced with the min_score and max_score args.

The full judging prompt looks like:

{filled_prescript}

<rubric>
{filled_rubric}
</rubric>

{filled_postscript}

Warning

If you override the default prescript or postscript, you must include the ${min_score}, ${max_score}, and ${content} tags in the prompt, and instruct the judge to include the <rationale> and <score> tags in the response. We do not recommend overriding the default prescript or postscript.

Examples

>>> rubric = '''
... You are tasked with evaluating whether a restaurant recommendation is good.
... The scoring is as follows:
... - 1: If the recommendation doesn't meet any of the criteria.
... - 2: If the recommendation meets only some small part of the criteria.
... - 3: If the recommendation is reasonable, but not perfect.
... - 4: If the recommendation is almost perfect.
... - 5: If the recommendation is perfect.
... '''
>>> rubric_judge_spec = RubricJudgeSpec(
...     model_type="rubric_judge",
...     rubric=rubric,
...     model="openai/openai/gpt-4o",
...     min_score=1,
...     max_score=5,
... )

Example Usage

from martian_apart_hack_sdk import judge_specs

# Create a rubric judge that evaluates responses on a scale of 1-5
rubric = """
You are tasked with evaluating whether a restaurant recommendation is good.
The scoring is as follows:
- 1: If the recommendation doesn't meet any of the criteria.
- 2: If the recommendation meets only some small part of the criteria.
- 3: If the recommendation is reasonable, but not perfect.
- 4: If the recommendation is almost perfect.
- 5: If the recommendation is perfect.
"""

judge_spec = judge_specs.RubricJudgeSpec(
    model_type="rubric_judge",
    rubric=rubric,
    model="openai/openai/gpt-4o",
    min_score=1,
    max_score=5,
)

Other Judge Types

The following judge types are also available but are primarily used internally or in advanced use cases:

GoldMatchJudge: Similar to RubricJudge but specialized for comparing responses against known good answers.
MaxScoreJudge: Takes multiple judges and returns the highest score among them.
MinScoreJudge: Takes multiple judges and returns the lowest score among them.
ConstantJudge: Always returns a fixed score and reason.
AverageScoreJudge: Takes multiple judges and returns their average score.
SumJudge: Takes multiple judges and returns the sum of their scores.
ExactMatchJudge: Checks if responses exactly match a set of known answers.
CaseJudge: Applies different judges based on conditional logic.

For most use cases, the RubricJudgeSpec is recommended as it provides the most flexibility and natural language understanding capabilities.