Obtaining malware samples to analyse can be quite difficult at times, depending on the type of access that one has. Several (free) online malware sample databases exist, but visiting them one by one to check if a specific sample is present, is a tedious and time consuming task. MalPull is created to automate the search on multiple platforms, and download samples from the sample database that contains the sample. The program’s source code and precompiled Java Archive and can be found on GitHub. The latest release is also available on GitHub.
Table of contents
MalPull uses the APIs of MalShare, Malware Bazaar, Koodous, VirusTotal, and Triage to search for a sample based on a given MD-5, SHA-1, or SHA-256 hash. MalShare, Koodous, and Triage require an API key that can be obtained by creating a free account. The API key of VirusTotal is only usable if one has a paid account. Note that not all services have to be used when using MalPull.
To optimally use the query limits one has with these services, MalPull queries free services without a API request limit first. After that, free services with an API request limit are queried, with paid services at last. The order, if all services are in use, is given below.
- Malware Bazaar
To run the program, one needs a recent version of the Java Runtime Environment. It has been tested with OpenJRE 8, but the code is not dependent on a specific Java version. MalPull requires no further installation, as the dependencies are embedded within the JAR. The required command-line arguments provide MalPull with the API keys, hashes, and the location to save all downloads to.
The compilation for this project is done using Maven. To compile the Java code with its dependencies, one can use the command that is given below. Note that the current working directory needs to be in the MalPull folder for the exact command to work.
mvn clean compile assembly:single
After the compilation, the compiled JAR is placed inside the target folder.
To use MalPull, one has to provide four command line arguments. The first argument is the amount of threads that can be used by MalPull when downloading samples. The minimum is one, and the maximum is left up to the user. Using more threads than you have (virtual) cores is unlikely to give an advantage due to the way threads are scheduled.
The second argument is the location to a file that contains API keys for all services, of which an example is given below.
virustotal=abcd1234 malshare=abcd1234 koodous=abcd1234 malwarebazaar=enabled triage=abcd1234
The order of the services in this file does not matter. If you wish to not use any of the services, simply remove them from this file. Note that Malware Bazaar does not have an API key, meaning that any value can be used. Malware Bazaar is represented this way to offer the user the option to include or exclude the service.
The third argument is the file location of a file that contains the hashes. Each hash needs to be separated by a new line. The hashes are deduplicated by MalPull, meaning duplicate entries are only downloaded once in total.
The fourth argument is the folder to store all downloaded samples in. The file name of each sample is equal to the file’s hash, as given in the list of hashes. If a file with the same name in the given location already exists, it is overwritten without warning. If the output folder does not exist, it is created, including any of the missing parent directories.
An example on how to run MallPull is given below.
java -jar /path/to/MallPull.jar 6 ~/Downloads/malpull_test/keys.txt ~/Downloads/malpull_test/hashes.txt ~/Downloads/malpull_test/output/
Hashes that cannot be found on any of the services, are printed once all hashes have been iterated through.
The modularisation of MalPull
Since version 1.3-stable, MalPull consists of two main modules, both of which are located within the same project. The main module, located at malpull.MalPull, is the multi-threaded downloader, which downloads the files for the given hashes. The other module is the command-line interface, which is located at malpull.cli. This module handles the interpretation of the given command-line arguments, and the file handling for the API keys and hashes files.
One can use the main module in any project, be it to create a new command-line or graphical user interface, or to use the downloader functionality within a project as an object. To do so, create an instance of malpull.MalPull, which contains the logic for the downloader, and can log its output to any given PrintWriter. To easily use the standard output, one can use System.out as the writer, or one can create a custom object to log the output to a different stream or file.
Below, a list of planned updates is given in no apparent order:
- Optionally save the sample in a password protected ZIP archive
- Add more malware database repositories (such as URLHaus, Hybrid-Analysis, or AnyRun)
In the list below, all changes are kept together with the release date of the given version.
List of features
- Changed the project’s internals into a modular structure. The multi-threaded downloader module is now separate from the command-line interface. As such, one can easily use MalPull with a different interface (be it command-line based or a graphical user interface), or it can be used within a different project. More information can be found in the linked patch notes.
List of features
- Moved Triage to be queried as first service, rather than being last.
- Added Triage support.
- Minor fixes (removed spelling mistakes in the JavaDoc, added missing JavaDoc entries, fixed the incorrect download count)
List of features
- Added VirusTotal support.
- Added multi-threading support to download multiple samples at the same time. The maximum thread count is configurable as a command-line setting.
- The input now requires a file that contains all hashes that are to be downloaded, separated by a newline. The command-line requires an argument that specifies the location of the input file.
- The API keys are stored in a separate file, allowing for a more efficient use of the command-line arguments.
- If a hash cannot be found on any of the enabled services, it is added to a list of missing hashes. This list is printed once all samples have been downloaded.
- The total time spent for the downloading of all samples is given once all samples have been downloaded
- Duplicate entries in the download list are filtered prior to downloading, thus avoiding double API queries that would impact the query limit of any of the used services.
- The output folder is given via the command-line interface, where each file is written to. File names of samples are based upon the hash in the list of samples that are to be downloaded. Existing files will be overwritten without warning.