1mkdir evals2cd evals3npm init -y
1npm install -D promptfoo2npm install -D @twilio-alpha/assistants-eval
Set at least two of the following environment variables for your test:
TWILIO_ACCOUNT_SID
→ The Account SID for your Twilio Account that holds your AI Assistant.TWILIO_AUTH_TOKEN
→ The Auth Token linked to your respective Twilio Account.To test only one AI Assistant, store your AI Assistant ID as TWILIO_ASSISTANT_ID
, but you can define it later in your test configuration.
To set these variables, you have two options:
Create a .env
file and use a library like env-cmd to use the .env
file. You use this command later in this guide.
You can install it by running:
npm install -D env-cmd
Add a .env
file to your repository and add it to your .gitignore
file.
Your env file should look like this:
1TWILIO_ACCOUNT_SID=<your account sid>2TWILIO_AUTH_TOKEN=<your auth token>
Add the following line to your package.json
file in the "scripts"
section:
1"scripts": {2"eval": "cd config && env-cmd -f ../.env promptfoo eval"3}
This JSON block allows you to run npm run eval
. It uses the .env
file for your environment variables.
Create a folder called config
:
mkdir config
Create a file called promptfooconfig.yml
inside this config
folder.
touch ./config/promptfooconfig.yml
In the promptfooconfig.yml
, add the following code.
1description: LLM Evaluation Test Suite2maxConcurrency: 53targets:4- id: package:@twilio-alpha/assistants-eval:TwilioAgentProvider5config:6assistantId: aia_asst_11111111117tests:8- vars:9prompt: Ahoy!10assert:11- type: contains-any12value:13- Ahoy14- Hello15- Hi
Replace aia_asst_1111111111
with the ID of the AI Assistant you want to test.
npm run eval
This first test should perform the following actions:
The result displays in the terminal. For a richer interface, spin up a browser interface.
npx promptfoo view
To test if the AI Assistant called Knowledge or a Tool, you have two options.
If the AI Assistant can only access specific data through a tool call, validate that the Assistant returned that data in the output.
1- vars:2prompt: Hey I just landed and I can't see my bags3identity: 'email:demo@example.com'4sessionId: test-lost-bag5assert:6- type: contains-all7value:8- 'George'9- 'IAH81241D'10- 'IAH89751D'11- 'Nashville'12- 'missed transfer'
To test the tool without relying on the output, use the usedTool
assertion.
1- vars:2prompt: Hey I just landed and I can't see my bags3identity: 'email:demo@example.com'4sessionId: test-lost-bag5assert:6- type: javascript7value: package:@twilio-alpha/assistants-eval:assertions.usedTool8config:9expectedTools:10- name: 'Fetch status of checked bag'11input: 'IAH81241D'12output: 'missed transfer'13- name: 'Fetch status of checked bag'14input: 'IAH89751D'15output: 'missed transfer'16- name: 'non existent tool'17input: 'invalid'
To pass the test, every entry in the expectedTools
block must exist. A contains
check uses any value for name
, input
, or output
, so the values don't have to be a complete answer.
To use an LLM to judge the performance of your prompt, use the llm-rubric
assertion. Add an OpenAI API key to your .env
file.
1TWILIO_ACCOUNT_SID=<your account sid>2TWILIO_AUTH_TOKEN=<your auth token>3OPENAI_API_KEY=<your openAI API key>
To the top of your promptfooconfig.yaml
, add the following block.
1defaultTest:2vars:3runId: package:@twilio-alpha/assistants-eval:variableHelpers.runId
This configuration file needs this addition as promptfoo
doesn't make the eval run ID available to extensions.
To chain any test into the same conversation, add two name-value pairs: sessionId: <session-id>
and runSerially: true
. For example, the following JSON test block displays two tests to be executed in the same conversation.
1tests:2- vars:3prompt: What compensation can I expect for my delayed luggage?4sessionId: test-12345options:6runSerially: true7assert:8- type: llm-rubric9value: Is the response talking about reimbursement of reasonable expenses such as the purchase of essential items like toiletries and clothing?10- vars:11prompt: What about if it's permanently lost?12sessionId: test-123413options:14runSerially: true15assert:16- type: llm-rubric17value: Is this a helpful answer for {{question}}?
Optional: To reuse the output of a previous message, set the storeOutputAs: <variable-name>
name-value pair . You can then use that variable in subsequent messages.
1tests:2- vars:3prompt: What about if it's permanently lost?4sessionId: test-12345options:6runSerially: true7storeOutputAs: policy8assert:9- type: llm-rubric10value: Is this a helpful answer for {{question}}?11- vars:12prompt: 'Can you simplify this policy to me: {{policy}}'13sessionId: test-123414options:15runSerially: true16assert:17- type: llm-rubric18value: Is this a simplified version of {{policy}}?
To override the default random identity and test an existing customer account, set the identity: <customer-identifier>
in a test case.
1tests:2- vars:3prompt: Hello!4identity: 'email:demo@example.com'5assert:6- type: contains-all7value:8- 'George'
To test the Perception Engine, use the {{runId}}
variable as part of your identity. In this case, add runSerially: true
to your options
block of the test.
1tests:2- vars:3prompt: 'You are Esteban and are reaching out to order a new pair of shoes. Make sure you introduce yourself with your name. You should ask what is available and order one pair of Nike that is available in size 9. If asked you live at 123 Fake St in Brookly NYC.'4maxTurns: 65sessionId: shoe-order6identity: 'user_id:esteban-{{runId}}'7assert:8- type: llm-rubric9value: The assistant should confirm the order placed and specify the order number.10- type: contains-all11value:12- 'FedEx'13- 'Esteban'14- type: contains-any15value:16- '12345678uZ91011'17- '1 2 3 4 5 6 7 8 u Z 9 1 0 1 1'18options:19runSerially: true2021- vars:22prompt: 'Hi there!'23sessionId: shoe-order-checkin24identity: 'user_id:esteban-{{runId}}'25assert:26- type: contains-all27value:28- 'Esteban'29options:30runSerially: true
To have another AI play the role of a customer, add the maxTurns: <int>
name-value pair to your test. The value of your prompt
parameter instructs the AI model acting as a customer and is not a prompt sent to the AI Assistant.
1tests:2- vars:3prompt: "You are Peter Parker and are reaching out because it's been 22 days since your luggage got lost during a flight. You want to know what kind of compensation you can expect based on policies. You don't have time to find your reference number and don't care. The regular policy is fine."4maxTurns: 45sessionId: test-knowledge6assert:7- type: llm-rubric8value: You got informed that the policy is involves compensation of the weight.