How to Test your AI Assistants

Requirements

Node.js and npm (or equivalent package manager)
A Twilio Account
A configured Twilio AI Assistant

Installation and setup

Create a new project

1  mkdir evals
2  cd evals
3  npm init -y

Install dependencies

1  npm install -D promptfoo
2  npm install -D @twilio-alpha/assistants-eval

Setup your environment variables

Set at least two of the following environment variables for your test:
- TWILIO_ACCOUNT_SID → The Account SID for your Twilio Account that holds your AI Assistant.
- TWILIO_AUTH_TOKEN → The Auth Token linked to your respective Twilio Account.
To test only one AI Assistant, store your AI Assistant ID as TWILIO_ASSISTANT_ID, but you can define it later in your test configuration.

To set these variables, you have two options:
1. Set these environment variables on your system or
2. Create a .env file and use a library like env-cmd to use the .env file. You use this command later in this guide.
  
  You can install it by running:
  npm install -D env-cmd
Add a .env file to your repository and add it to your .gitignore file.
Your env file should look like this:
1
TWILIO_ACCOUNT_SID=<your account sid>
2
TWILIO_AUTH_TOKEN=<your auth token>

Setup your eval command

Add the following line to your package.json file in the "scripts" section:

1  "scripts": {
2    "eval": "cd config && env-cmd -f ../.env promptfoo eval"
3  }

This JSON block allows you to run npm run eval. It uses the .env file for your environment variables.

Create your first test

Create a folder called config:
mkdir config
Create a file called promptfooconfig.yml inside this config folder.
touch ./config/promptfooconfig.yml

In the promptfooconfig.yml, add the following code.

1description: LLM Evaluation Test Suite
2maxConcurrency: 5
3targets:
4  - id: package:@twilio-alpha/assistants-eval:TwilioAgentProvider
5    config:
6      assistantId: aia_asst_1111111111
7  tests:
8  - vars:
9      prompt: Ahoy!
10    assert:
11      - type: contains-any
12        value:
13          - Ahoy
14          - Hello
15          - Hi

Replace aia_asst_1111111111 with the ID of the AI Assistant you want to test.

Run the test.

  npm run eval

This first test should perform the following actions:
- Send a message: "Ahoy!"" to your AI Assistant.
- Check that the output returns "Ahoy," "Hello," or "Hi."
The result displays in the terminal. For a richer interface, spin up a browser interface.
npx promptfoo view

Testing Knowledge and Tools

To test if the AI Assistant called Knowledge or a Tool, you have two options.

Option 1

If the AI Assistant can only access specific data through a tool call, validate that the Assistant returned that data in the output.

1  - vars:
2      prompt: Hey I just landed and I can't see my bags
3      identity: 'email:demo@example.com'
4      sessionId: test-lost-bag
5    assert:
6      - type: contains-all
7        value:
8          - 'George'
9          - 'IAH81241D'
10          - 'IAH89751D'
11          - 'Nashville'
12          - 'missed transfer'

Option 2 (experimental)

To test the tool without relying on the output, use the usedTool assertion.

1  - vars:
2      prompt: Hey I just landed and I can't see my bags
3      identity: 'email:demo@example.com'
4      sessionId: test-lost-bag
5    assert:
6      - type: javascript
7        value: package:@twilio-alpha/assistants-eval:assertions.usedTool
8        config:
9          expectedTools:
10            - name: 'Fetch status of checked bag'
11              input: 'IAH81241D'
12              output: 'missed transfer'
13            - name: 'Fetch status of checked bag'
14              input: 'IAH89751D'
15              output: 'missed transfer'
16            - name: 'non existent tool'
17              input: 'invalid'

To pass the test, every entry in the expectedTools block must exist. A contains check uses any value for name, input, or output, so the values don't have to be a complete answer.

Using the `llm-rubric`

To use an LLM to judge the performance of your prompt, use the llm-rubric assertion. Add an OpenAI API key to your .env file.

1  TWILIO_ACCOUNT_SID=<your account sid>
2  TWILIO_AUTH_TOKEN=<your auth token>
3  OPENAI_API_KEY=<your openAI API key>

Handling multi-turn conversations

To the top of your promptfooconfig.yaml, add the following block.

1  defaultTest:
2    vars:
3      runId: package:@twilio-alpha/assistants-eval:variableHelpers.runId

This configuration file needs this addition as promptfoo doesn't make the eval run ID available to extensions.

To chain any test into the same conversation, add two name-value pairs: sessionId: <session-id> and runSerially: true. For example, the following JSON test block displays two tests to be executed in the same conversation.

1  tests:
2    - vars:
3        prompt: What compensation can I expect for my delayed luggage?
4        sessionId: test-1234
5      options:
6        runSerially: true
7      assert:
8        - type: llm-rubric
9          value: Is the response talking about reimbursement of reasonable expenses such as the purchase of essential items like toiletries and clothing?
10    - vars:
11        prompt: What about if it's permanently lost?
12        sessionId: test-1234
13      options:
14        runSerially: true
15      assert:
16        - type: llm-rubric
17          value: Is this a helpful answer for {{question}}?

Optional: To reuse the output of a previous message, set the storeOutputAs: <variable-name> name-value pair . You can then use that variable in subsequent messages.

1  tests:
2    - vars:
3        prompt: What about if it's permanently lost?
4        sessionId: test-1234
5      options:
6        runSerially: true
7        storeOutputAs: policy
8      assert:
9        - type: llm-rubric
10          value: Is this a helpful answer for {{question}}?
11    - vars:
12        prompt: 'Can you simplify this policy to me: {{policy}}'
13        sessionId: test-1234
14      options:
15        runSerially: true
16      assert:
17        - type: llm-rubric
18          value: Is this a simplified version of {{policy}}?

Testing Customer Memory

To override the default random identity and test an existing customer account, set the identity: <customer-identifier> in a test case.

1  tests:
2    - vars:
3        prompt: Hello!
4        identity: 'email:demo@example.com'
5      assert:
6        - type: contains-all
7          value:
8            - 'George'

To test the Perception Engine, use the {{runId}} variable as part of your identity. In this case, add runSerially: true to your options block of the test.

1  tests:
2    - vars:
3        prompt: 'You are Esteban and are reaching out to order a new pair of shoes. Make sure you introduce yourself with your name. You should ask what is available and order one pair of Nike that is available in size 9. If asked you live at 123 Fake St in Brookly NYC.'
4        maxTurns: 6
5        sessionId: shoe-order
6        identity: 'user_id:esteban-{{runId}}'
7      assert:
8        - type: llm-rubric
9          value: The assistant should confirm the order placed and specify the order number.
10        - type: contains-all
11          value:
12            - 'FedEx'
13            - 'Esteban'
14        - type: contains-any
15          value:
16            - '12345678uZ91011'
17            - '1 2 3 4 5 6 7 8 u Z 9 1 0 1 1'
18      options:
19        runSerially: true
20
21    - vars:
22        prompt: 'Hi there!'
23        sessionId: shoe-order-checkin
24        identity: 'user_id:esteban-{{runId}}'
25      assert:
26        - type: contains-all
27          value:
28            - 'Esteban'
29      options:
30        runSerially: true

AI-driven conversations

To have another AI play the role of a customer, add the maxTurns: <int> name-value pair to your test. The value of your prompt parameter instructs the AI model acting as a customer and is not a prompt sent to the AI Assistant.

1  tests:
2    - vars:
3        prompt: "You are Peter Parker and are reaching out because it's been 22 days since your luggage got lost during a flight. You want to know what kind of compensation you can expect based on policies. You don't have time to find your reference number and don't care. The regular policy is fine."
4        maxTurns: 4
5        sessionId: test-knowledge
6      assert:
7        - type: llm-rubric
8          value: You got informed that the policy is involves compensation of the weight.