Ghidra Tip 0x02: BSim

This article is based on the public release of Ghidra 11.0.1.

In 2023, just before Christmas, the NSA released a new feature for Ghidra called BSim. This feature is best explained by stating the feature’s name in full: Behavior Similarity. The comparison of functions is useful for a variety of purposes, such as but not limited to malware analysis and vulnerability research. This article will focus on BSim’s purpose, advantages, background, and usage.

Table of contents

Purpose

When reversing binaries, one might encounter the same code multiple times. This can be due to the use of the same library code within different files, or because one is looking at an update of one file while comparing it to a previous version.

The version comparison can occur when patch diffing files to find out what changed, thus showing the fixes to a vulnerability. It can also be useful when tracking a malware family, of which a previous analysed version is available, and one would like to look into changes in the updated version.

A side-by-side comparison of the assembly instructions for two functions, and automation thereof, is useful, but makes it hard to compare differences between files compiled for different architectures. BSim works by creating vectors of the decompiler’s high P-code. P-code is Ghidra’s internal intermediate language, used for all supported architectures.

As such, it is possible to compare code of different architectures, which has also been kept in mind by the Ghidra team when developing BSim. When creating a database, one can specify nosize to the database type, which ensures that varnodes (used within P-code) of four bytes in size, and larger, and not used in BSim’s features.

To illustrate while cutting corners in various places: the size of an integer on a 32-bit architecture is 4 bytes in size, while the size on a 64-bit architecture is 8 bytes. As such, comparing two integers of different architectures would be difficult if the size would play a (significant) role.

Advantages

While this feature is not unique per se, given the existence of other open-source tools such as BinDiff and Diaphora, it does excel at several key features. Listed below are several advantages, in no particular order.

Firstly, it is developed by the Ghidra team, letting it work optimally with said tool, and ensure it is kept up to date over time. Given that it is part of the framework, it allows one to utilise this feature when writing scripts, making it easier to integrate in custom use-cases.

Secondly, BSim’s excellent scalability. There is no shortage of files that are desirable to be included when searching for functions similar in behaviour to one or more function(s) of interest. As such, there is the ever present tradeoff between query speed and the size of the database(s).

The query speed depends on the used hardware, but also on the chosen type of BSim database, as some can work in a cluster while others cannot. The file size of the BSim database can grow take up tens of gigabytes, if not more, if desired. The size of the Ghidra projects where the signatures are generated from, are required to compare a given function with one from the sample at hand.

Thirdly, one can generate BSim databases ahead of time, meaning that the lookup time is significantly less, when compared to generated the signatures of all potentially interesting files when executing a query. This saves time, and allows one to share BSim databases.

Lastly, BSim’s signatures are based on the decompiler’s high P-code. Given that it works based on (high) P-code, it means that any architecture that is supported by Ghidra, can be used to find code overlap in. Additionally, if one were to add support for a new architecture to Ghidra, it immediately allows one to use P-code.

Background

Each BSim signature contains vectors, which are compared with another signature. The outcome of this comparison consists of two variables: the similarity and the significance or confidence.

The similarity is a value between zero and one and can be interpreted as as a percentage of the similarity between two functions, where a value of 1 is an exact match. Ghidra’s documentation states “the similarity of a match is the cosine of the angle between the vectors”.

The significance or confidence states how significant a given match is, regardless how similar the two functions are. To use the example from Ghidra’s documentation: a small function which simply returns a constant value might occur in multiple files, but due to its small size, it is unlikely to be rather significant. Having said that, large functions which have a lot of overlap are more significant, simply because they contain more code.

Databases

As mentioned before, BSim can be used with several database back-ends: H2, PostgreSQL, and Elastic. The latter two are to be set-up as servers which need to be connected with, while the H2 database is a file. Depending on your operating system and hardware specifications, you might be able to run the database servers on your analysis machine, depending on your preferred set-up.

The advantage of the H2 database is that it functions on any operating system which is supported by Ghidra. It is, however, not indexed. As such, too many entries within the database will slow it down significantly, although this depends on the minimum threshold for the similarity and significance. The higher the value, the more exact the match needs to be, thus yielding less results.

Additionally, one can set the maximum number of function matches to be gathered from the database per function that is searched for. The higher this number, the more potential to query has to slow down, depending on how many entries the database contains. The Ghidra team elaborated on several of my questions in a GitHub issue.

The other two databases are indexed, and can be set-up in a cluster for high availability, as is explained in the documentation of both PostgreSQL and Elastic. The indexation allows for a more optimised query handling, and the high availability cluster allows to utilise multiple servers to handle queries to avoid overloading a given server. This might be too much for a given set-up, but for those who need it, the scalability of this feature is tremendously helpful.

Usage

One can use BSim via Ghidra’s graphical user interface (GUI), via scripts, and via Ghidra’s headless execution. One can use either of those options to create a H2 database, populate it with signatures, and to scan a program based on signatures within said database. One can follow the BSim tutorial that is included in Ghidra’s documentation for hands-on steps to follow. No tutorial is included here, given the fact that BSim is bound to change over time, and the NSA’s documentation is kept up to date with pushed changes.


To contact me, you can e-mail me at [info][at][maxkersten][dot][nl], or DM me on BlueSky @maxkersten.nl.