Skip to contentSkip to navigationSkip to topbar
On this page

How to Test your AI Assistants



Requirements

requirements page anchor
  • Node.js and npm (or equivalent package manager)
  • A Twilio Account
  • A configured Twilio AI Assistant

Create a new project

create-a-new-project page anchor
1
mkdir evals
2
cd evals
3
npm init -y
1
npm install -D promptfoo
2
npm install -D @twilio-alpha/assistants-eval

Setup your environment variables

setup-your-environment-variables page anchor
  1. Set at least two of the following environment variables for your test:

    • TWILIO_ACCOUNT_SID → The Account SID for your Twilio Account that holds your AI Assistant.
    • TWILIO_AUTH_TOKEN → The Auth Token linked to your respective Twilio Account.

    To test only one AI Assistant, store your AI Assistant ID as TWILIO_ASSISTANT_ID, but you can define it later in your test configuration.

    To set these variables, you have two options:

    1. Set these environment variables on your system or

    2. Create a .env file and use a library like env-cmd(link takes you to an external page) to use the .env file. You use this command later in this guide.

      You can install it by running:

      npm install -D env-cmd
  2. Add a .env file to your repository and add it to your .gitignore file.
    Your env file should look like this:

    1
    TWILIO_ACCOUNT_SID=<your account sid>
    2
    TWILIO_AUTH_TOKEN=<your auth token>

Add the following line to your package.json file in the "scripts" section:

1
"scripts": {
2
"eval": "cd config && env-cmd -f ../.env promptfoo eval"
3
}

This JSON block allows you to run npm run eval. It uses the .env file for your environment variables.


  1. Create a folder called config:

    mkdir config
  2. Create a file called promptfooconfig.yml inside this config folder.

    touch ./config/promptfooconfig.yml
  3. In the promptfooconfig.yml, add the following code.

    1
    description: LLM Evaluation Test Suite
    2
    maxConcurrency: 5
    3
    targets:
    4
    - id: package:@twilio-alpha/assistants-eval:TwilioAgentProvider
    5
    config:
    6
    assistantId: aia_asst_1111111111
    7
    tests:
    8
    - vars:
    9
    prompt: Ahoy!
    10
    assert:
    11
    - type: contains-any
    12
    value:
    13
    - Ahoy
    14
    - Hello
    15
    - Hi

Replace aia_asst_1111111111 with the ID of the AI Assistant you want to test.

  1. Run the test.
npm run eval
  1. This first test should perform the following actions:

    • Send a message: "Ahoy!"" to your AI Assistant.
    • Check that the output returns "Ahoy," "Hello," or "Hi."

    The result displays in the terminal. For a richer interface, spin up a browser interface.

    npx promptfoo view

Testing Knowledge and Tools

testing-knowledge-and-tools page anchor

To test if the AI Assistant called Knowledge or a Tool, you have two options.

If the AI Assistant can only access specific data through a tool call, validate that the Assistant returned that data in the output.

1
- vars:
2
prompt: Hey I just landed and I can't see my bags
3
identity: 'email:demo@example.com'
4
sessionId: test-lost-bag
5
assert:
6
- type: contains-all
7
value:
8
- 'George'
9
- 'IAH81241D'
10
- 'IAH89751D'
11
- 'Nashville'
12
- 'missed transfer'

Option 2 (experimental)

option-2-experimental page anchor

To test the tool without relying on the output, use the usedTool assertion.

1
- vars:
2
prompt: Hey I just landed and I can't see my bags
3
identity: 'email:demo@example.com'
4
sessionId: test-lost-bag
5
assert:
6
- type: javascript
7
value: package:@twilio-alpha/assistants-eval:assertions.usedTool
8
config:
9
expectedTools:
10
- name: 'Fetch status of checked bag'
11
input: 'IAH81241D'
12
output: 'missed transfer'
13
- name: 'Fetch status of checked bag'
14
input: 'IAH89751D'
15
output: 'missed transfer'
16
- name: 'non existent tool'
17
input: 'invalid'

To pass the test, every entry in the expectedTools block must exist. A contains check uses any value for name, input, or output, so the values don't have to be a complete answer.


To use an LLM to judge the performance of your prompt, use the llm-rubric assertion. Add an OpenAI API key to your .env file.

1
TWILIO_ACCOUNT_SID=<your account sid>
2
TWILIO_AUTH_TOKEN=<your auth token>
3
OPENAI_API_KEY=<your openAI API key>

Handling multi-turn conversations

handling-multi-turn-conversations page anchor

To the top of your promptfooconfig.yaml, add the following block.

1
defaultTest:
2
vars:
3
runId: package:@twilio-alpha/assistants-eval:variableHelpers.runId

This configuration file needs this addition as promptfoo doesn't make the eval run ID available to extensions.

To chain any test into the same conversation, add two name-value pairs: sessionId: <session-id> and runSerially: true. For example, the following JSON test block displays two tests to be executed in the same conversation.

1
tests:
2
- vars:
3
prompt: What compensation can I expect for my delayed luggage?
4
sessionId: test-1234
5
options:
6
runSerially: true
7
assert:
8
- type: llm-rubric
9
value: Is the response talking about reimbursement of reasonable expenses such as the purchase of essential items like toiletries and clothing?
10
- vars:
11
prompt: What about if it's permanently lost?
12
sessionId: test-1234
13
options:
14
runSerially: true
15
assert:
16
- type: llm-rubric
17
value: Is this a helpful answer for {{question}}?

Optional: To reuse the output of a previous message, set the storeOutputAs: <variable-name> name-value pair . You can then use that variable in subsequent messages.

1
tests:
2
- vars:
3
prompt: What about if it's permanently lost?
4
sessionId: test-1234
5
options:
6
runSerially: true
7
storeOutputAs: policy
8
assert:
9
- type: llm-rubric
10
value: Is this a helpful answer for {{question}}?
11
- vars:
12
prompt: 'Can you simplify this policy to me: {{policy}}'
13
sessionId: test-1234
14
options:
15
runSerially: true
16
assert:
17
- type: llm-rubric
18
value: Is this a simplified version of {{policy}}?

To override the default random identity and test an existing customer account, set the identity: <customer-identifier> in a test case.

1
tests:
2
- vars:
3
prompt: Hello!
4
identity: 'email:demo@example.com'
5
assert:
6
- type: contains-all
7
value:
8
- 'George'

To test the Perception Engine, use the {{runId}} variable as part of your identity. In this case, add runSerially: true to your options block of the test.

1
tests:
2
- vars:
3
prompt: 'You are Esteban and are reaching out to order a new pair of shoes. Make sure you introduce yourself with your name. You should ask what is available and order one pair of Nike that is available in size 9. If asked you live at 123 Fake St in Brookly NYC.'
4
maxTurns: 6
5
sessionId: shoe-order
6
identity: 'user_id:esteban-{{runId}}'
7
assert:
8
- type: llm-rubric
9
value: The assistant should confirm the order placed and specify the order number.
10
- type: contains-all
11
value:
12
- 'FedEx'
13
- 'Esteban'
14
- type: contains-any
15
value:
16
- '12345678uZ91011'
17
- '1 2 3 4 5 6 7 8 u Z 9 1 0 1 1'
18
options:
19
runSerially: true
20
21
- vars:
22
prompt: 'Hi there!'
23
sessionId: shoe-order-checkin
24
identity: 'user_id:esteban-{{runId}}'
25
assert:
26
- type: contains-all
27
value:
28
- 'Esteban'
29
options:
30
runSerially: true

To have another AI play the role of a customer, add the maxTurns: <int> name-value pair to your test. The value of your prompt parameter instructs the AI model acting as a customer and is not a prompt sent to the AI Assistant.

1
tests:
2
- vars:
3
prompt: "You are Peter Parker and are reaching out because it's been 22 days since your luggage got lost during a flight. You want to know what kind of compensation you can expect based on policies. You don't have time to find your reference number and don't care. The regular policy is fine."
4
maxTurns: 4
5
sessionId: test-knowledge
6
assert:
7
- type: llm-rubric
8
value: You got informed that the policy is involves compensation of the weight.

Need some help?

Terms of service

Copyright © 2025 Twilio Inc.