Part one: using Large Language Models for Failure Logic Analysis

News | Posted on Tuesday 30 April 2024

In the first of this three part blog series, Research and Innovation Fellow Dr Kester Clegg explores Large Language Models’ (LLM) ability to ‘explain’ complex texts and poses the question of whether their encoded knowledge is sufficient to reason about system failures in a similar way to human analysts.

This image was created using AI image generator Dashtoon Studio.

Traditionally such reasoning required real-world knowledge of failure mechanisms and the subsequent loss of higher-level functionality, i.e. how faults propagate through systems. 

An LLM makes an interesting work colleague; you can instruct, interrogate, dispute, correct and repeat its task as many times as you like (or at least until you exceed your usage cap). 

Let’s start with the premise: an LLM can do stuff quickly but badly. Is it still useful for safety analysis?

Generative AI is lightning quick at ‘understanding’ / parsing / explaining complex stuff, generating code, pictures, etc. Why not get an LLM to parse a system description for how the components could fail and generate a fault tree diagram to visualise it? In this scenario, we’re imagining a PSSA (preliminary system safety analysis) stage where there may only be high level system descriptions available. 

Fault tree analysis often requires proprietary tools that use their own graphical formats, and graphical formats are slow and inefficient to draw manually. However, some graphical tools have script options, e.g. Visio, LaTeX, etc. that are known to LLMs, enabling us to consider a text-(to-code)-to-diagrams pipeline. 

For this we assume the LLM will create a ‘straw man analysis’ that the safety analyst can knock about to improve. The chat style interface to many LLMs allows you to iterate (and improve) over many answers until you get the answer that is either sufficiently good for you to finish and polish up yourself or is good enough to use directly. So the process pipeline looks like this:

A Large Language Model process pipeline

For this preliminary work we decided to pick the best overall performing LLM at the time, GPT-4, and see what the options are for creating a bespoke LLM to do failure logic analysis and visualisation. GPT-4 isn’t a specialist in failure logic, it requires some form of additional fine tuning for the task. There are various possibilities:

  • create our own GPT using OpenAI builder.
  • Use an API with vector embeddings for RAG type retrieval (not really helpful in this instance). 
  • Use ‘agentic’ AI type approach, with the task broken up and delegated to agents, perhaps knitted together with python and within a framework such as ollama, LMStudio, LangChain or similar.

Given this was preliminary work, with no budget and even less time allocated to it, we chose to use OpenAI’s GPT Builder. In theory, this requires no coding ability, just describe to Builder what you want, and it auto-magically does the rest. 

But what is the rest? Unfortunately, that’s probably the bit you want to do ‘by hand’. Why? Well in our experience, the chat Builder interface that wrote system instructions automagically for your GPT produced weak GPTs, meaning that they tended to generalise their answers towards standard GPT-4 responses, instead of being focused on our particular use case.

So what do you do if the Builder interface doesn’t work for you? Then you need to write your own ‘system instructions’. A word of caution here, using the ‘Create’ interface in GPT Builder but can randomly overwrite your system instructions. 

To give your GPT a ‘role’ you need to describe that role in the system instructions and try to force GPT-4 towards a specific behaviour. This is something of a dark art, very similar to prompt engineering, except the user never gets to see it. This is when we started to realise that bespoke GPTs struggle to override their underlying model behaviour. Which when you consider what a GPT is - essentially a system prompt over another system prompt with another user defined prompt (or ‘system instruction’) - perhaps isn’t surprising. There’s no fine-tuning involved in creating a GPT, and so there’s no realistic chance of overriding any issues you might have with the underlying model performing a specific task.

You can test your GPT in the playground without ‘publishing’ it – but you still use up quota (max 40 queries per three hours). Frustratingly, the fact that your GPT works in the playground, doesn’t mean that when you publish it you don’t find out the next day that it’s reverted to back GPT-4 output. Sure, you can tell it to follow its system instructions and that might work. But how would a user know to do that if they don’t know what is in those instructions? More importantly, why should they need to?

FLAGPT: a Failure Logic GPT

We created the FLAGPT in two parts:

  1. We told it how to do failure logic analysis, including how to analyse for redundancy, common cause and subtle cases of failure logic. This part includes instructions on level of detail and chain-of-thought reasoning.
  2. We gave it example code to visualise its failure logic. Initially we tried Visual Basic and C# for Visio but it wouldn’t play with the Visio fault tree template. Eventually we ended up using the TiKZ picture environment in LaTeX, as this could be reduced to simple node declarations in an input file (much less code needed = less chance of errors via hallucinations due to output token length).

See the next blog: Part two: continuing an exploration of Large Language Models’ (LLMs) ability to ‘explain’ complex texts for how we implemented the system instructions.