Exploring German LLM Expert Bots

Author

Moritz Gueck

Published

December 11, 2023

In this post we are exploring how we can provide a large language model with context information (“grounding”) and have it answer questions in German about it. We will then explore how different models perform at answering the questions.

Intro

What we need:

  1. A large language model that can answer questions. -> Question Answering LLM (QA-LLM)
  2. A large language model that can map text to vectors. -> Vector-LLM
  3. A database to store the vectors. -> Vector-DB

We will be using the framework Langchain for splitting texts, grounding and connecting to models and sources. Langchain provides us with a couple of benefits over specific APIs from different providers:

  • Standardised interface for different document-loaders, LLMs and vector-DBs
  • Easy integration of different LLMs into different vector-DBs
  • The high level API avoids boilerplate code.

Here is a good intro to the topic: langchain: Retrieval.

How to do it:

  1. Create a grounding database for your QA-LLM.
    1. Gather the information you want your QA-llm to answer questions about.
    2. Cut the data into snippets that are small enough to be processed by your QA-llm.
    3. Map each snippet to a vector using your Vector-llm. The vector represents the meaning/topic of the snippet.
    4. Store the vectors in a database.
  2. Answer questions using the grounding database.
    1. Given a question, map it to a vector using your Vector-llm. Then search the database for the snippets with the most similar vectors.
    2. Prepend the snippet to your question.
    3. Feed the augmented question to your QA-llm.

graph TB
  subgraph SB["build grounding database"]
  A(Data Sources) -->|Load| B(Text-Files)
  B -->|Chunk| C(Snippets)
  C -->|"vector-map (Vector-LLM)"| D[(Vector DB)]

  end

  subgraph SU["use grounding database"]
  D -->|retrieve| E(Relevant snippets)
  E -->|insert| F(Augmented Prompt)
  F -->|"query (QA-LLM)"| G(Result)
  end

  style SB fill:#F2F2F2
  style SU fill:#F2F2F2
  linkStyle 0,1,2,3,4,5 stroke:#BFBFBF 

Show code: Base libraries to import
# Importing libraries
import os
import urllib.request, urllib.error, urllib.parse
import random
import textwrap
Show code: Print results more nicely
def nice_print(text):
    print(textwrap.fill(text, 120))

1. Creating a grounding database

1.1. Gather the information

In this step we will load the data that we want our model to answer questions about. We will use the data from the public, German Helsana website (largest health insurance in Switzerland).

from urllib.parse import urlparse
# Required functions for loading the data:
def download_webpage(url):
    response = urllib.request.urlopen(url)
    webContent = response.read().decode("UTF-8")

    file_path = "data/html/" + get_page_name(url)
    f = open(file_path, "w")
    f.write(webContent)
    f.close

def get_page_name(url):
    parsed_url = urlparse(url)
    page_name = parsed_url.path.split("/")[-1]
    return page_name
Show code: Web-pages to crawl
# List of web-pages where we will find the information for our knowledge base
urls = [
    "https://www.helsana.ch/de/private/versicherungen/grundversicherung.html",
    "https://www.helsana.ch/de/private/versicherungen/grundversicherung/basis.html",
    "https://www.helsana.ch/de/private/versicherungen/grundversicherung/benefit-plus-hausarzt.html",
    "https://www.helsana.ch/de/private/versicherungen/grundversicherung/benefit-plus-telmed.html",
    "https://www.helsana.ch/de/private/versicherungen/grundversicherung/benefit-plus-flexmed.html",
    "https://www.helsana.ch/de/private/versicherungen/grundversicherung/premed-24.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/ambulant.html",
    "https://www.helsana.ch/de/private/versicherungen/grundversicherung/uebersicht-grundversicherungen.html",
    "https://www.helsana.ch/de/private/versicherungen/spezialversicherungen.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/ambulant/leistungsuebersicht.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/ambulant/top.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/ambulant/sana.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/ambulant/completa.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/ambulant/world.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/ambulant/primeo.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/spitalversicherung.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/spitalversicherung/hospital-eco.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/spitalversicherung/hospital-halbprivat.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/spitalversicherung/hospital-privat.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/spitalversicherung/hospital-flex.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/zahnversicherung.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/weitere/advocare-plus.html",
    "https://www.helsana.ch/de/private/versicherungen/zusatzversicherungen/weitere/advocare-extra.html",
]
# Downloading 
for url in urls:
    download_webpage(url)

1.2. Cut the data into snippets

Now we need to cut the contents from the page into snippets that are small enough to be processed by our QA-llm but large enough to contain the relevant context.

1.2.1. Minimalist approach (not used)

This is probably the simplest approach to cut the webpages into snippets.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from bs4 import BeautifulSoup
website_texts = []
for html_document_path in os.listdir("data/html"):
    soup = BeautifulSoup(
        open("./data/html/" + html_document_path), features="html.parser"
    )

    website_text = soup.get_text()
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=50,
        length_function=len,
        add_start_index=True,
    )
    page_texts = text_splitter.create_documents([website_text])
    website_texts = website_texts + page_texts

Regrettably, the vector-LLM had a hard time to reasonably map the snippets to meaningful vectors and the QA-LLM was not able to answer questions based on the snippets. The reasons were probably, that the snippets started in the middle of paragraphs and titles from the paragraphs were missing. The QA-LLM then used snippets to answer questions from wrong topics.

Therefore, I did not use this approach, but used approach 1.2.2. instead.

1.2.2. Subsection based splitting (used)

To embed the meaning and context clearer, we prepend the title and subtile of the relevant snippet to each text.

# This algorithm might be stupid but it works. :-D
def parse_website_texts(soup):
    """Cut website texts into snippets. 

    Args:
        soup (BeautifulSoup.soup): Soup object from the Beautiful soup webscraper. 
         It contains the different elements of the html-code of the website.

    Returns:
        : list(str): List of strings. 
         Each strings contains the title, subtitle and text of a snippet.
    """
    level_dict = {
        "h1": 0,
        "h2": 1,
        "h3": 2,
        "p": 3,
        "li": 3,
    }  # hierarchical level from html tags
    element_list = soup.find_all(["h1", "h2", "h3", "p", "li"])
    prev_level = 9999
    webpage_snippet_list = []
    snippet_texts_list = []
    # Pattern of each snippet: h1-title, h2-subtitle, h3-subsubtitle, paragraph-text:
    base_element_list = ["","", "", ""]  
    for element in element_list:
        current_level = level_dict[element.name]

        if current_level < prev_level:  # i.e. we are at a new topic
            # save the previous snippet as one string
            webpage_string_snip = " ".join(snippet_texts_list)
            webpage_string_snip = webpage_string_snip.replace("\n", " ")
            if webpage_string_snip != "":
                webpage_snippet_list.append(webpage_string_snip)
            # update base_element_list
            base_element_list[prev_level:current_level] = ""
            base_element_list[current_level] = element.text
            snippet_texts_list = base_element_list

        # update snippet_texts_list
        else:
            snippet_texts_list = snippet_texts_list + [element.text]
        prev_level = current_level
    return webpage_snippet_list
from bs4 import BeautifulSoup

website_texts = []
for html_document_path in os.listdir("data/html"):
    soup = BeautifulSoup(
        open("./data/html/" + html_document_path), features="html.parser"
    )
    website_texts_page = parse_website_texts(soup)

    website_texts = website_texts + website_texts_page

Here is an example of the snippets (heading + subheadings + paragraphs):

for text in random.sample(website_texts, 5):
    print(text)
Helsana Advocare PLUS Häufig gestellte Fragen    Wer kann diese Versicherung abschliessen?    Sie können die Versicherung abschliessen, wenn Sie folgende Voraussetzungen erfüllen: Sie leben in der Schweiz (offizieller Wohnsitz). Sie haben bereits eine der Zusatzversicherungen TOP, OMNIA oder COMPLETA oder beantragen diese zeitgleich mit Helsana Advocare PLUS.
SANA Weitere Zusatzversicherungen TOP  Ihr Zusatz zur Grundversicherung: Wichtige ambulante Leistungen sind gedeckt.
BeneFit PLUS Telmed    Bei gesundheitlichen Problemen rufen Sie immer zuerst das unabhängige Zentrum für Telemedizin an: 0800 800 090. Sie erhalten rund um die Uhr medizinische Unterstützung und einen attraktiven Prämienrabatt.    24/7 kostenlose, verbindliche medizinische Telefonberatung     Digitale Services, wie z. B. Symptom-Checker und Videokonsultation     Attraktiver Prämienrabatt  
BeneFit PLUS Telmed  Prämie berechnen  Ihre Prämie CHF 0 CHF 500 CHF 300 CHF 500 CHF 1000 CHF 1500 CHF 2000 CHF 2500 eingeschlossen ausgeschlossen
BeneFit PLUS Flexmed Weitere Modelle der Grundversicherung BeneFit PLUS Hausarzt  Der Hausarzt oder die HMO-Gruppenpraxis ist Ihre erste Anlaufstelle.

1.3. Map each snippet to a vector

Here we use another large language model to map each snippet to a vector. This vector represents the meaning/topic of the snippet. I used the FastEmbed model for practical reasons. However, I would recommend to use a more powerful model like this one: https://huggingface.co/sentence-transformers/all-mpnet-base-v2

from langchain.embeddings.fastembed import FastEmbedEmbeddings
# Choosing a suitable embedding model can make a big difference in retrieval performance.
# For simplicity, we use the FastEmbedEmbeddings model off-the-shelf.
embedder = FastEmbedEmbeddings() 

# Just to show a vector, we embed the first snippet:
embeddings = embedder.embed_documents(website_texts[0])
print(embeddings[0][0:5])  # The first 5 dimensions of the vector
[-0.013905464671552181, 0.038332026451826096, 0.01669456996023655, 0.010435071773827076, -0.01078716292977333]

1.4. Store embeddings in a vector database

We could just keep the embeddings in memory or store them in a file. It is however more efficient to store them in a database. I used the open-source chroma database for this purpose.

from langchain.vectorstores import Chroma
chroma_db = Chroma.from_texts(website_texts, embedder)

2. Answering questions using the grounding database

Conclusions

  • Grounding:
    • Grounding your model helps it giving you correct answers to very specific questions.
    • Formatting the text in a way that the model can understand it is crucial.
  • Larger Models perform better:
    • GPT3.5 is better at answering questions off-the-shelf.
    • Llama-2 70B is able to answer the questions correctly. However, it struggles to provide correct German texts.
    • Mistral 7B struggles to answer questions correctly. More effort than I took is required.
  • Different models need different prompt templates.
  • With few-shot learning you can improve the quality of the response dramatically.
  • A high-level framework like langchain makes switching models and sources easy.

Outlook

  • New Models:
    • Just today Mistral released a new multi-lingual model with very impressive reported performance. It’s size is between Mistral 7B and LLama2 70B. This might solve our problems with Mistral: Mistral: Mixtral of experts
    • There are LLama-Models that have been pretrained on German texts and might provide a better performance with our German prompts and responses: Huggingface: LLama2 German Assistant
  • Fine Tuning models: Since the models from Mistral are substantially smaller, they can more easily be retrained (fine-tuned) to achieve higher accuracy. Furthermore, parameter tuning could help lessening hallucinations of the model (i.e. making things up).
  • Benchmarking: To properly evaluate the performance of the different approaches (Vector-LLM and QA-LLM), a larger test-set and requirements for correct answers would be needed.

Acknowledgments

A special thank you to Moritz Settele and Koen Tersago from morrow ventures for their helpful feedback on this blog-post.