Guiscard is a French startup working on an open source testing framework for large language models. It can alert developers to the risks of biases, vulnerabilities, and the model’s ability to create malicious or toxic content.
While there is a lot of hype around AI models, machine learning testing systems will quickly become a hot topic as regulation is about to be introduced in the European Union through the Artificial Intelligence Act, and in other countries. Companies developing AI models will have to prove they comply with a set of rules and mitigate risks so they don’t have to pay heavy fines.
Giskard is an AI startup that embraces regulation and one of the first examples of a developer tool that specifically focuses on testing in a more efficient way.
“I’ve worked at Dataiku before, especially in the area of NLP model integration. I can see that when I was in charge of testing, there were a couple of things that didn’t work well when I wanted to apply them to practical cases, and it was very difficult to compare vendor performance on Among them,” Alex Compisi, co-founder and CEO of Giskard, told me.
There are three components behind the Giskard testing framework. First, the company issued Open source Python library Which can be integrated into an LLM project – more specifically Retrieval Augmented Generation (RAG) projects. It’s very popular on GitHub already and is compatible with other tools in machine learning ecosystems, such as Hugging Face, MLFlow, Weights & Biases, PyTorch, Tensorflow, and Langchain.
After the initial setup, Giskard helps you create a test suite that will be used regularly on your model. These tests cover a wide range of issues, such as performance, hallucinations, misinformation, unrealistic output, biases, data leakage, malicious content generation, and spot injection.
“And there are several aspects: You will have the performance aspect, which will be the first thing that comes to a data scientist’s mind. But more and more, you have the ethical aspect, both from a brand image point of view and now from a regulatory point of view,” Compisi said.
Developers can then integrate the tests into a continuous integration and continuous delivery (CI/CD) pipeline so that the tests are run every time there is a new iteration on the codebase. If there is an error, developers receive a scan report in their GitHub repository, for example.
Tests are customized based on the end use case of the model. Companies working on RAG can give access to Giskard’s vector databases and knowledge repositories so that the test suite is as relevant as possible. For example, if you are creating a chatbot that can provide you with information about climate change based on the latest IPCC report and using OpenAI’s LLM, Giskard tests will check whether the model can generate false information about climate change, or contradict itself etc.
Image credits: Guiscard
Giskard’s second product is the AI Quality Center which helps you debug a large language model and compare it to other models. This quality center is part of Giskard Outstanding offer. In the future, the startup hopes to be able to create documentation proving the model’s compliance with regulatory rules.
“We have started selling the AI Quality Hub to companies like Banque de France and L’Oréal – to help them debug errors and find the causes of errors. In the future, this is where we will put all the organizational features,” said Compisi.
The company’s third product is called LLMon. It is a real-time monitoring tool that can evaluate LLM answers to the most common problems (toxicity, hallucinations, fact checking…) before sending the response back to the user.
It currently works with companies that use OpenAI’s APIs and LLMs as their core model, but the company is working on integration with Hugging Face, Anthropic, etc.
Organizing use cases
There are several ways to organize AI models. Based on conversations with people in the AI ecosystem, it’s still unclear whether AI law will apply to basic models from OpenAI, Anthropic, Mistral, and others, or only to applied use cases.
In the latter case, Guiscard seems particularly well placed to alert developers about the potential misuse of LLMs enriched with external data (or as AI researchers call them, Retrieval Augmented Generation, RAG).
There are currently 20 people working at Giskard. “We see a very clear market fit for LLM customers, so we will almost double the size of the team to be the best LLM antivirus on the market,” Compisi said.