Top 5 Large Language Models and How to Use Them Effectively

Modern Large Language Models (LLMs) are pre-trained on a large corpus of self-supervised textual data, then tuned to human preferences via techniques such as reinforcement learning from human feedback (RLHF).

LLMs have seen rapid advances over the last decade or so, particularly since the development of generative pre-trained transformers (GPTs) in 2012. Google’s BERT, introduced in 2018, represented a significant advance in capability and architecture and was followed by OpenAI’s release of GPT-3 in 2022, and GPT-4 the following year.

While open sourcing AI models is controversial given the potential for widespread abuse—from generating spam and disinformation, to misuse in synthetic biology—we have seen a number of open source alternatives which can be cheaper and as good as their proprietary counterparts.

Use Cases for LLMs

Given how new this all is, we’re still getting to grips with what may or may not be possible with the technology. But the capabilities of LLMs are undoubtedly interesting, with a wide range of potential applications in business. These include being used as chatbots in customer support settings, code generation for developers and business users, and audio transcription summarizing, paraphrasing, translation and content generation.

You can imagine, for example, that customer meetings could be both transcribed and summarized by a suitably-trained LLM in near real time, with the results shared with the sales, marketing and product teams. Or an organization’s web pages might automatically be translated into different languages. In both cases, the results would be imperfect but could be quickly reviewed and fixed by a human reviewer.

In a coding context, many of the popular internal development environments now support some level of AI-powered code completion, with GitHub Copilot, Sourcegraph, and CodeWhisperer among the leading examples in enterprises. Other related applications, such as natural language database querying, also show promise. LLMs might also be able to generate developer documentation from source code.

LLMs could prove useful when working with other forms of unstructured data in some industries. “In wealth management,” Madhukar Kumar, CMO of SingleStore, a relational database company, told the New Stack, “we are working with customers who have a huge amount of unstructured and structured data, such as legal documents stored in PDFs and user details in database tables, and we want to be able to query them in plain English using a Large Language Model.”

SingleStore is seeing clients using LLMs to perform both deterministic and non-deterministic querying at the same time.

“In wealth management, I might want to say, ‘Show me the income statements of everyone aged 45 to 55 who recently quit their job,’ because I think they are right for my 401(k) product,” Kumar said.

“This requires both database querying via SQL, and the ability to work with that corpus of unstructured PDF data. This is the sort of use case we are seeing a lot.”

An emerging application of AI is for agentic systems. “We’re seeing a number of new AI companies amongst our customers who are looking to make their data immediately available to build agentic systems,” Kumar told us. “In cybersecurity, for example, you might take several live video feeds and give that to an AI to make decisions very quickly.”

Large language models have been applied to areas such as sentiment analysis. This can be useful for organizations looking to gather data and feedback to improve customer satisfaction. Sentiment analysis is also helpful for identifying common themes and trends in a large body of text, which may assist with both decision-making and more targeted business strategies.

As we’ve noted elsewhere, one significant challenge with using LLMs is that they make stuff up. For example, the winning solution for a benchmarking competition—organized by Meta and based on Retrieval Augmented Generation (RAG) and complex situations—was wrong about half the time. These findings are similar to those from NewsGuard, a rating system for news and information websites, which showed that the 10 leading chatbots made false claims 40% of the time and gave no answers to 22% of questions. Using RAG and a variety of other techniques can help, but eliminating errors completely looks to be impossible. In view of this, LLMs should not be used in any situation where accuracy matters.

Training an LLM from scratch remains a major undertaking, so it makes more sense to build on top of an existing model where possible. We should also note that the environmental costs of both training and running an LLM are considerable; because of this we recommend only using an LLM where there isn’t a smaller, cheaper alternative. We would also encourage you to ask the vendor or OSS project to disclose their figures for training and running the model, though at the time of writing this information is increasingly hard to obtain.

With Kumar’s help, we’ve compiled a list of what we think are the five most important LLMs at the moment. If you are looking to explore potential uses for LLMs yourself, these are the ones we think you should consider.

The Top 5 LLMs

Best ‘Reasoning’ Models

Reasoning models produce responses incrementally, simulating to a certain extent how humans grapple with problems or ideas.

OpenAI o3-mini-high

OpenAI’s o3-mini-high has been fine-tuned for STEM problems, specifically programming, math and science. As such, and with all the usual caveats that apply to benchmarking, it currently scores highest on the GPQA benchmark commonly used for comparing reasoning performance.

Developers can choose between three reasoning effort options—low, medium and high—to optimize for their specific use cases. This flexibility allows o3‑mini to ‘think harder’ when tackling complex challenges, or prioritize speed when latency is a concern. It is also OpenAI’s first small reasoning model to support function calling⁠, structured outputs⁠, and developer messages.

OpenAI no longer discloses carbon emissions, though model size does make a difference, and claimed improvements to response times imply a lower overall carbon running cost.

DeepSeek-R1

DeepSeek reasoning models were, they claim, trained on a GPU cluster a fraction of the size of any of the major western AI labs.They’ve also released a paper explaining what they did, though some of the details are sparse. The model is free to download and use under an MIT license.

R1 scores highly on the GPQA benchmark, though it is now beaten by o3-mini. DeepSeek says it has been able to do this cheaply—the researchers behind it claim it cost $6m (£4.8m) to train, a fraction of the “over $100m” alluded to by OpenAI boss, Sam Altman, when discussing GPT-4. It also uses less memory than its rivals, ultimately reducing the carbon and other associated costs for users.

DeepSeek is trained to avoid politically sensitive questions—for example, it will not give any details about the Tiananmen Square massacre on 4 June 1989.

You don’t necessarily need to stick to the version DeepSeek provides, of course. “You could use it to distill a model like Qwen 2.5 or Llama 3.1, and it is much cheaper than OpenAI,” Kumar admitted.

Best for Coding Tasks

Anthropic Claude 3.7 Sonnet

While speed of typing or lines of code have long since been debunked as a good measure of developer performance—and many experienced developers have expressed reservations about using AI-generated code—coding is one of the areas where GenAI appears to have early product market fit. It works well because mistakes are typically easy to spot or test for, meaning that the aforementioned accuracy problems are less of an issue.

While most developers will likely favor the code-completion system built into their IDE, such as JetBrains AI or GitHub Copilot, the current best-in-class on the HumanEval benchmark is Claude 3.5 Sonnet from Anthropic. “When it comes to coding, Claude is still the best,” Kumar told us. “I’ve personally used it for hours and hours, and there is very little debate around it.”

This proprietary model also scores well on agentic coding and tool use tasks. On TAU-bench, an agentic tool use task, it scores 69.2% in the retail domain, and 46% in the airline domain. It also scores 49% on SWE-bench Verified.

At the time of writing, Anthropic have just released Claude 3.7 Sonnet which, the vendor claims, “shows particularly strong improvements in coding and front-end web development.” Claude 3.7 Sonnet with extended thinking—letting you see Claude’s thought process alongside its response—is offered as part of a Pro plan. Anthropic also offers a GitHub integration across all Claude plans, allowing developers to connect code repositories directly to Claude.

Best General Purpose

Meta Llama 3.1 405b

Both OpenAI’s o3 and DeepSeek’s R1 models score highly as general purpose models, but we’re fans of the Meta Llama family of open source models which come close. It uses a Mixture of Experts (MoE) approach, which is an ensemble learning technique that scales model capacity without significantly increasing training or inference costs. MoEs can dramatically increase the number of parameters without a proportional increase in computational cost.

Llama 3.1 405b scores 88.6% on the MMLU benchmark, putting it a hair’s-breadth behind the considerably more computationally expensive alternatives.

Google Gemini Flash 2.0

Google’s experimental Gemini Flash 2.0 scores lower than Llama on the MMLU benchmark, at 76.2%, but it has other capabilities that make it interesting. It supports multimodal output like natively-generated images mixed with text and steerable text-to-speech (TTS) multilingual audio. It can also natively call tools like Google Search and code execution, as well as third-party user-defined functions. It is also impressively fast and has one of the largest context sizes of 1 million tokens.

Google is also actively exploring agentic systems through Project Astra and Project Mariner, and Flash 2.0 is built with the intention of making it particularly suitable for agentic systems.

Picking an LLM

Once you’ve drawn up a shortlist of LLMs, and identified one or two low-risk use cases to experiment with, you have the option of running multiple tests using different models to see which one works best for you—as you might do if you were evaluating an observability tool or similar.

It’s also worth considering whether you can use multiple LLMs in concert. “I think that the future will involve not just picking one, but an ensemble of LLMs that are good at different things,” Kumar told us.

Of course, none of this is particularly useful to you unless you have timely access to data. During our conversation, Kumar suggested that this was where contextual databases like SingleStore come in.

“To truly use the power of LLMs,” he said, “you need the ability to do both lexical and semantic searches, manage structured and unstructured data, handle both metadata and the vectorized data, and handle all of that in milliseconds, as you are now sitting between the end user and the LLM’s response.”

The post Top 5 Large Language Models and How to Use Them Effectively appeared first on The New Stack.