Behind the Magic: GenAI in CyberTalent Bridge

Behind the magic of retrieval augmented generation

AI |

artifical intelligence |

By Anderson Wiese, Boss

Everything’s about AI these days, right? Our focus at 2wav is on building advanced applications with practical tools, and our perspective on AI follows that track. In 2024, we helped with a research project that provided a trove of useful insights that will help us build practical AI assisted applications in many knowledge domains. I’m finally getting around to sharing some highlights of this experience.

For the TL;DR, here are some useful takeaways, subject to many caveats and conditions:

We used RAG (retrieval augmented generation) techniques to enhance CyberTalent Bridge™ with new mappings between two published cybersecurity frameworks, comparing popular models and APIs.
Claude 3.5 Sonnet access via AWS Bedrock appeared to perform our task better than GPT-4o mini.
AWS Bedrock model costs appear to be considerably less than OpenAI.
AWS Bedrock knowledge base costs (for serverless OpenSearch) are prohibitively high for many casual applications.
Both models had a slight tendency to hallucinate or ignore instructions, and they were surprisingly persistent in those deviations.
We think ontologies have an important role in LLM applications—no surprise, coming from me. ;-)

Credit Where It’s Due

This opportunity is thanks to Dr. Dippakamur Pravin from the University of North Texas, and his students Benjamin Edgar and Tyrell Richardson. We’ve been working with Dr. Pravin and many of his students for several years as part of the DHS Summer Research Team program. These teams have consistently contributed to CyberTalent Bridge™ with research, implementation techniques, and working product enhancements. We are hugely grateful to Dr. Pravin and our generous benefactors at the US Dept. of Homeland Security. CyberTalent Bridge™ is developed with support from the U.S. Department of Homeland Security under Grant Award Number, 2015-ST-061-CIRC01.

The Mission

CyberTalent Bridge™ allows organizations to define cybersecurity tasks associated with Controls from a requirements framework such as the NIST 800-53, and then receive recommendations about the qualification of the talent pool. Each worker has a CyberTalent Passport™ that associates talent experience, certifications, and education with capabilities described by the NICE Workforce Framework for Cybersecurity (NICE Framework).

screen shot of the CyberTalent Bridge showing a task on the left and qualified talent on the right

Ontologies are at the heart of CyberTalent Bridge’s ability to bridge organizational requirements to talent capabilities. Custom ontologies map the relevance of 800-53 Controls to NICE Framework Work Roles, Tasks, Knowledge, and Skills. These mappings derive from choosable algorithms, and can be adjusted by authorized users.

In this project, we used LLMs with Retrieval Augmented Generation (RAG) techniques to create new ontologies that map controls from the NIST SP 800-53 Security and Privacy Controls [https://csrc.nist.gov/pubs/sp/800/53/r5/upd1/final] to relevant Work Roles from the NICE Framework. Beta customers now have the option to use any of four different base mappings, including new mappings from Claude 3.5 Sonnet or GPT-4o mini.

Screen shot showing choice of mapping, including Claude 3.5 Sonnet

Models, APIs, Knowledge Bases, & Hosting

We sampled a variety of models from the usual suspects: OpenAI, Anthropic, Meta, and AWS. Considering cost as well as performance, we narrowed our focus to Anthropic’s Claude 3.5 Sonnet and OpenAI GPT-4o mini.

We looked briefly at self-hosting open weight models like Llama, but decided that cloud hosted models and APIs are the most practical for projects like this and most 2wav clients. We used OpenAI APIs and cloud-hosted knowledge base (KB) for GPT-4o mini, and Amazon Bedrock for Claude, with a serverless OpenSearch KB.

OpenAI has a friendlier API and documentation. Bedrock is typical of AWS APIs—designed by those in the fold for those in the fold, a nightmare of IAM service roles and permissions. However, Bedrock provides deep access to an enormous range of services, and dozens of different models. 2wav has deep experience with AWS since 2008. Now that we have the foundation laid, Bedrock will probably be our choice for a wider range of future projects.

Cost Comparison

This is not a strictly controlled comparison between OpenAI and Bedrock. My subjective impression is that we accomplished similar work, with similar effort, in a similar number of computational runs.

The cost of OpenAI models appears to be much higher than the Bedrock cost for Anthropic models, but AWS billing doesn’t tell us the exact number of tokens processed. The methodology we used with GPT-4o mini involved several prompts for each control that was processed, so it is likely that our OpenAI approach had higher total input tokens. That said, in our peak months of development the OpenAI model cost was 20x the cost of Claude 3.5 Sonnet on Bedrock.

On the other hand, the cost of maintaining the OpenSearch knowledge base on Bedrock is surprisingly high—about $180/month for a KB with just the 800-53 and NICE ontologies. This is a static, always-on cost, regardless of usage. For the five most active months of the project, this pushed the AWS cost to 50% more than OpenAI. In the future, 2wav will explore hosting our own vectorized KBs, probably in MongoDB.

Stubborn Hallucinations & Inconsistencies

Our results were promising overall, but there are occasional problems evident in the output from both models. Our basic methodology follows:

A knowledge base was constructed for each platform containing the textual content of each 800-53 Control, and the Work Roles, Tasks, Knowledge and Skills from the NICE Framework.
Each model was instructed to consider only the material in the knowledge base. Various prompt engineering techniques were used to discourage the model from using its own inherent training—a basic premise of RAG application programming.
For each of 322 800-53 Controls, the model was asked to name the most relevant Work Roles from NICE 1.0.0, ranking each work role with a relevancy percentage.

We repeated the same prompts for each of 322 800-53 Controls, looking for consistently relevant results in a machine readable format. We were surprised that the models would be well behaved most of the time, but occasionally produce erratic results, and then return to the expected behavior in a later iteration. Some examples of this erratic behavior included:

For some Controls, the model would cite the same relevancy for every Work Role. Frequently this would be 0 or 1.
Occasionally, relevancy would be expressed in a decimal instead of a percentage.
Sometimes the model suggested work roles which do not exist, which is a clear hallucination.
Claude had a tendency to cite Work Roles from the 2017 framework, which violated the instruction to consider only Work Roles from the NICE 1.0.0 knowledge base.

We made efforts to correct these mistakes with further prompt adjustments, and asking for a “redo” when invalid answers were detected. Both models were surprisingly stubborn about repeating the same errors, often on the same controls, even when told why the errors were incorrect and asked to try again. Our limited research project concluded without a 100% solution to these problems.

Ontologize!

Using the models to create new ontologies is subtly different from using the models directly within the CyberTalent Bridge application. This approach has several advantages for us:

There is no ongoing cost of AI integration once the ontology has been created.
Recommendations made by CyberTalent Bridge using the ontology are deterministic and repeatable.
We can use ontology reasoning to sanity check the results. For example, we can easily verify that recommended work roles exist.

Formal Reasoners with AI

We are optimistic about the use of ontologies to check the consistency of AI results. Reasoners can check that a knowledge graph is consistent with the relationships expressed in an ontology. If GenAI output can be transformed into formal vocabularies, as we did in this project, then a reasoner can confirm that the output does not contain obvious logical flaws.

At present we do this outside the model, but in the future these practices could be trained into the models or embedded tools. We currently have the option with both platforms to request output as JSON that conforms to a JSON Schema. It would be useful to request output as JSON-LD conforming to a JSON-LD @context. Ultimately, it would be fantastic if the models could produce output that is consistent with the ontologies named by the JSON-LD @context.

NICE 1.0.0 Ontology

We chose to use the NICE Framework 1.0.0, released in 2024. CyberTalent Bridge previously used the NICE Framework published in 2017. Thanks to built-in ontology reasoning in CyberTalent Bridge, the new NICE components are easily compatible with older data.

The latest version is available as an OLIR JSON document, along with a mapping of the new framework to the previous 2017 version. Whatever the motivation for creating OLIR (or OSCAL), it certainly wasn’t to facilitate practical software development. With significant effort, we transformed the NICE Framework OLIR to OWL, with which we updated our NICE ontology to include both 2017 and 1.0.0 versions, including linkages between the new and old components. Using an owl:sameAs linkage, our ontology reasoner was able to apply the new components to all CyberTalent Passports that were previously created with the old framework. Please contact me if you are interested in using our unified NICE ontology, available in JSON-LD.

The following image shows a snippet from a CyberTalent Passport with the work role Cybersecurity Architecture, which is inferred from the 2017 work role Security Architect via the owl:sameAs property, which is associated with a job position held by this worker. Confidence is a unique weighting factor of CyberTalent Bridge—contact me for details.

Screen shot showing a work role with confidence factor

Wrap Up

We had a great time exploring the practical ins and outs of building an effective RAG application with popular tools and models. We’ve seen firsthand that error detection and mitigation will be necessary features of most applications. For experienced application developers, this is a different sort of experience—prompt engineering is more coaxing than programming.

Our work in this project is the beginning of a widespread transformation of CyberTalent Bridge. The tools we have developed are already being incorporated into other applications that 2wav develops. We will be very active in this realm in coming months and years.

We’re ready to discuss how practical, affordable AI techniques could benefit your application. Let your ideas inspire us.