Automatic string formatting deobfuscation

Malware that is not compiled has the advantage that the raw source code is always given. Malicious actors know this, and thus the code is obfuscated. This makes it harder to analyse the sample or to create a generic detection rule. Technical details regarding this case’s Powershell sample are given below. The sample can be downloaded here. On VirusTotal, only 2 out of 57 antivirus engines detected the sample. The deobfuscated sample is detected by 11 out of 56 antivirus engines. This provides proof that the obfuscation techniques work.

MD5: 907dbc3048f75bb577ff9c064f860fc5

SHA-1: 667b8fa87c31f7afa9a7b32c3d88f365a2eeab9c

SSDeep: 6144:F7EhX4jlKpvFnMt8NKKfoIEFtUVETlqds6YGTC9HIN5Tao0jCGIop1Y6aiiNelyb:pQ39oIpyK+HI+3Npi6aiiNeewudtv

File size: 368.42 KB

File length: 2763 lines

In this case, a specific type of obfuscation was used: string formatting obfuscation. At first, the obfuscation will be explained together with a quick analysis of other obfuscated parts of the script. Afterwards, a script to automatically deobfuscate the sample will be created in a step-by-step approach.

Used obfuscation techniques

String formatting
A string can be made up of multiple other strings. Using flags within the literal string, the formatting is easier to read and write. In C, the flag %d is used for decimals. An example is given below in C.

int age = 100;
printf("My age equals %d\n", age);

The output of the code is given below.

My age equals 100

In C#, the type is derived from the referenced variable, as can be seen in the example below. Note that one can still use types, which is given in the second example.

//Types are derived from the variable
string s = String.Format("{0}'s age is {1}", "Libra", 100);
Console.WriteLine(s);
 
//Types are specified within the code
string s = String.Format("{0}'s age is {1:f}", "Libra", 100.5);
Console.WriteLine(s);

To obfuscate text using this method, one simply has to split a string and scramble the parts. These parts are assembled in the correct order during runtime, allowing the program to continue. This is unfavourable for an analyst because the code is harder to understand and variable names cannot be changed by renaming them, unless the whole sample is deobfuscated. An example of the obfuscation technique is demonstrated below.

string something = String.Format("{4}{2}{3}{1}{0}", "ion", "at", "fu", "sc", "ob");

The value of something is equal to the string obfuscation.

String concatenation
A string can be made up of multiple characters or multiple strings. By adding them together, the name of a variable can be obfuscated. Below, an example is given.

string x = "ab" + "cd";
string y = x + "ef";

In the example, x equals abcd and y equals abcdef. This method also prevents refactoring of code if the name of the variable is created like this. In normal code, this is not possible, but when using reflection to invoke methods, it is.

Back ticks
In Powershell, one can use back ticks (`) within variable names. These are ignored, but also prevent refactoring.

Regular expressions

To find matches of a pattern in text (and thus code), one can use regular expressions. Regular expressions are widely used and there are numerous online tools to aid in writing them. In this sample, the strings that are unreadable and cannot be refactored will be matched and replaced with a readable version. Regular expressions are also known as regexes.

Automated deobfuscation

To automate the deobfuscation, Python3 is used. At first, the patterns that need to be matched will be analysed, after which the regular expression is made. Lastly, the matched data is replaced with the deobfuscated data in order to create a readable output.

Finding patterns
Upon inspecting the sample, the three mentioned types of string obfuscation are found. The back tick obfuscation is given below.

function iN`VokE`-r`F`BuxmE`HAEmZbhI

As can be seen above, the order of the characters is already correct, but renaming is impossible without removing or ignoring the back ticks.

The second method that is found within the sample, is obfuscation based on string formatting. It sometimes is quite lengthy, as can be seen in the example below.

$I7KUHX  =[tYpe]("{7}{2}{5}{10}{13}{1}{4}{9}{14}{12}{0}{3}{11}{6}{8}"-f 'mAR','opse','yst','sh','R','eM.RuNti','T','S','E','v','ME.','ALAsATtrIbu','ces.','intEr','I') ;

The string itself equals SysteM.RuNtiME.intEropseRvIces.mARshALAsATtrIbuTE. The odd casing is irrelevant since the [type] is used, which looks for a type under the given name. The value that [type] returns equals System.Runtime.InteropServices.MarshalAsAttribute.

The third, and last, type of obfuscation in this sample, is string concatenation. By splitting the string at different parts, the same name can be written differently, as can be seen below.

("VaRIA"+"BLE"+":Rb"+"h0")
("Va"+"RI"+"ABL"+"E:Rbh"+"0")

Removing backticks
Since back ticks have no proper use within the script, the string replace function in Python3 can be used to remove them all.

powershellPath = 'powershellSample.txt'
powershellFile = open(powershellPath,'r')
powershellContent = powershellFile.readlines()
for line in powershellContent:
    line = line.replace("`", "")

Removing string formatting
The string literal (the first part) and the parts of the original string (the added variables) are always found in the same pattern. A few examples are given below.

&("{0}{2}{1}"-f'sE','-iTeM','T')
$7eq=  [tyPe]("{1}{0}" -f '32','INT')  ;
&("{1}{0}{2}"-f 'eT-','S','itEM')
$j1v  =[tyPe]("{0}{1}"-F 'Co','NveRt') ;

Using this example, multiple observations can be made:

  • Every format option starts with an opening bracket, (, and quotation marks,
  • The curly brackets are used to decide the order of the string formatting
  • The format flag -f is used, as well as the capital cased version -F
  • Sometimes there is whitespace before the format flag
  • Sometimes there is whitespace after the format flag

Based on the observation above, the regex starts with quotes. Then, all numbers between curly brackets need to be matched for n-amount of times, since the length is unknown. This segment should also end with quotes. The following regex represents this case:

"((?:\{\d+\})+)"

The “(?:” and “)” indicate that the group shouldn’t be captured. The “\” is used to escape characters such as the curly brackets. The d matches a number. To also match cases in which the number is bigger than a single digit, the “+” is used. The “+” after the closing bracket is used to specify that one or more occurrences should be matched. The final part of the match (although it isn’t captured), are the quotes which are used to end the part which lists all indices.

The whole thing is enclosed by a pair of brackets, which indicate that this group should be captured. The indices, still enclosed between a pair of curly brackets, are the first capture group of the expression (located at index 0). In a later step, all the numbers will be extracted from this capture group.

After that, the format flag is encountered with possible whitespace before and/or after the format flag. The length of the whitespace is unknown. The “*” in regex is similar to the “+“, in the sense that it also repeats the previous statement. Whereas the “+” matches any number of matches from one upwards, the “*” also matches zero matches.

To detect whitespace, the format flag (in both lower and upper case) and more whitespace, another regex is required. The regex is given below.

\s*-[fF]\s*

The “\s” statement matches whitespace. Together with the “*” the whitespace can be of any length, even zero. Then the dash () is matched together with any of the characters between “[” and “]“. In this case the characters f or F are matched.

The last part of the regex matches the characters that are put between single quotes. The quote to open and close can be matched based upon their literal characters. The characters in between can be matched using the dot (.), which is used to match any character excluding line endings.

((?:'.*?',?)+)

The individual strings are not captured (using “(?:” and “)“), but the complete set of results is, because the “+” that matches one or more parts. The “.*?” matches any character between zero or more times as few times as possible. This is the second capture group of this regex, which resides at index 1.

The complete regex is given below.

"((?:\{\d+\})+)"\s*-[fF]\s*((?:'.*?',?)+)

Lastly, the indices and strings need to extracted from the matches of the above mentioned regex. The two capture groups have the following output:

'{7}{2}{5}{10}{13}{1}{4}{9}{14}{12}{0}{3}{11}{6}{8}'
'mAR','opse','yst','sh','R','eM.RuNti','T','S','E','v','ME.','ALAsATtrIbu','ces.','intEr','I'

To find all indices without brackets, one must match one or more numbers between curly brackets, as can be seen in the regex below.

{(\d+)}

One or more digits are captured, whilst ignoring the curly brackets. The list of indices that is returned by the regular expressions package of Python3 (referred to as re by using import re in the given Python3 code), is still seen as a list of strings. To convert the indices to integers, one can use the map function, which is also used in the code below.

for line in powershellContent:
    matchedLine = re.findall(""""((?:\{\d+\})+)"\s*-[fF]\s*((?:'.*?',?)+)""", line)
    if len(matchedLine) > 0:
        for match in matchedLine:
            indices = list(map(int, re.findall("{(\d+)}", match[0])))
            strings = re.findall("'([^']+?)'", match[1])
            result = "".join([strings[i] for i in indices])
            line = line.replace(match[0], result, 1)
            line = line.replace(match[1], "", 1)

Simply replacing the -f and -F characters in every line is possible, but will corrupt some other names in the code, as can be seen in the example below.

[IntPtr]${nE`WTH`Unkr`ef} = &("{0}{6}{1}{4}{5}{3}{2}" -f 'G','Remo','s','Addres','te','Proc','et-') -RemoteProcHandle ${R`eMOt`EPR`O`Ch`ANdLe} -RemoteDllHandle ${IM`PORt`dllHa`NDLE} -FunctionNamePtr ${pRoce`D`URen`A`MEpTr} -LoadByOrdinal ${lOaDb`yo`Rd`I`NAL}

The -FunctionNamePtr also starts with -F. To avoid this, one can also use a regular expression. The format flag should be removed, but only if there is no character behind it. Whitespace or a bracket is not a problem.

To match the format flag, the previously created regex (-[fF]) is used. Then, the character behind the match is checked, using the look ahead statement in regex. This is a new capture group, which starts with “?=“. Anything but the characters “a” through “z“, “A” through “Z“, “0” through “9” and “_” are matched using “[^\w]“. The term “^” negates the term that is used behind it. The term “\w” searches for characters, numbers and underscores. In this case, it matches everything that is not a character, number or underscore. The complete regex is given below.

(-[fF])(?=[^\w])

To remove the format flag from the sample, one replaces the matches of this regex with an empty string.

            formatFlag = re.findall("""(-[fF])(?=[^\w])""", line)          
            if len(formatFlag) > 0:
                for formatFlagMatch in formatFlag:
                    line = line.replace(formatFlagMatch, "")

Removing concatenating strings
A string that is created by adding multiple strings always has the same layout:

  • It starts with an opening bracket
  • It closes with a closing bracket
  • Each of the strings is enclosed between quotation marks
  • The operator to add two strings to each other is the plus (“+“) sign

At first, a check can be done to see if the character before the quotation mark is an opening bracket. The look behind statement can be used for this. The look behind statement is written as “(?= and “)“.

(?<=\()\"

Note that the quotation mark and the closing bracket are escaped, hence the backslashes.

The characters after the quotation mark are checked with a look ahead statement. These characters should not match a closing bracket, but should match a plus sign, after which no closing bracket is found.

(?=[^\)]+\+[^\)]+\))

Then, anything that is not a curly bracket, hyphen or closing bracket should be match, as much times as it occurs with the minimum of one occurrence. This is the data that is wanted, but its not put in a separate capture group (hence the "?:"), as the strings are already in the correct order.

(?:[^\{\}\-\)])+

The quote after the string should be followed with a closing bracket to indicate the end of the variable.

\"(?=\))

The complete regex is given below.

(?<=\()\"(?=[^\)]+\+[^\)]+\))(?:[^\{\}\-\)])+\"(?=\))

A full match looks like this:

"V"+"Ari"+"Ab"+"LE:cF84"

In order to correctly replace the strings with the concatenated string, all quotes, plus signs and spaces should be removed. After that, a single quotation mark should be put in front and at the end of the variable since it remains a string. By searching and replacing the given value, it can be refactored. The variables can also be properly read after the concatenation has been completed.

            varDeclaration = re.findall("""(?<=\()\"(?=[^\)]+\+[^\)]+\))(?:[^\{\}\-\)])+\"(?=\))""", line)
            variable = ''
            if len(varDeclaration) > 0:
                for string in varDeclaration:
                    variable = string.replace("\"", "")
                    variable = variable.replace("+", "")
                    variable = variable.replace(" ", "")
                    variable = "\"" + variable + "\""
                line = line.replace(varDeclaration[0], variable)

At last, all lines should be saved in a new file.

with open('deobfuscatedSample.txt', 'w') as f:
    f.write(output)

The complete Python3 script, with comments, is given below.

#Created by Max 'Libra' Kersten (@LibraAnalysis)
 
import re
 
#Define information regarding the original script's location
powershellPath = 'powershellSample.txt'
powershellFile = open(powershellPath,'r')
#Read all lines of the original script
powershellContent = powershellFile.readlines()
 
#The variable which contains all deobfuscated lines
output = ''
#The variable which keeps track of the amount of string formats that have been replaced
formatCount = 0
#The variable which keeps track of the amount of variables that have been replaced
variableCount = 0
#The variable which keeps track of the amount of removed back ticks
backtickCount = 0
 
#Loop through the file, line by line
for line in powershellContent:
    backtickCount += line.count("`")
    #Replace the back tick with nothing to remove the needless back ticks
    line = line.replace("`", "")
    #Match the string formatting
    matchedLine = re.findall(""""((?:\{\d+\})+)"\s*-[fF]\s*((?:'.*?',?)+)""", line)
    #If one or more matches have been found, continue. Otherwise skip the replacement part
    if len(matchedLine) > 0:
        #Each match in each line is broken down into two parts: the indices part ("{0}{2}{1}") and the strings ("var", "ble", "ia")
        for match in matchedLine:
            #Convert all indices to integers within a list
            indices = list(map(int, re.findall("{(\d+)}", match[0])))
            #All strings are saved in an array
            strings = re.findall("'([^']+?)'", match[1])
            #The result is the correctly formatted string
            result = "".join([strings[i] for i in indices])
            #The current line is altered based on the found match, with which it is replaced
            line = line.replace(match[0], result, 1)
            line = line.replace(match[1], "", 1)
            #Regex the "-f" and "-F" so that "-f[something]" is not replaced
            formatFlag = re.findall("""(-[fF])(?=[^\w])""", line)          
            if len(formatFlag) > 0:
                for formatFlagMatch in formatFlag:
                    line = line.replace(formatFlagMatch, "")
            #Find all strings between quotation marks.
            varDeclaration = re.findall("""(?<=\()\"(?=[^\)]+\+[^\)]+\))(?:[^\{\}\-\)])+\"(?=\))""", line)
            #The concatenated variable
            variable = ''
            #For each string in the list, the items are concatenated
            if len(varDeclaration) > 0:
                for string in varDeclaration:
                    variable = string.replace("\"", "")
                    variable = variable.replace("+", "")
                    variable = variable.replace(" ", "")
                    variable = "\"" + variable + "\""
                    variableCount += 1
            #Replace the variable with the concatenated one
                line = line.replace(varDeclaration[0], variable)
            formatCount += 1
    #When all matches are done, add the altered line to the output
    output += line
#When all lines are checked, write the output variable to a file
with open('deobfuscatedSample.txt', 'w') as f:
    f.write(output)
print("Amount of removed back ticks:")
print(backtickCount)
print("Amount of formatted strings that have been deobfuscated and concatenated:")
print(formatCount)
print("Amount of variables that have been concatenated:")
print(variableCount)
print("Total amount of modifications:")
print((backtickCount + formatCount + variableCount))

To know how much values have been altered, the count variables are used. The output of the script is given below.

Amount of removed back ticks:
8634
Amount of formatted strings that have been deobfuscated and concatenated:
1963
Amount of variables that have been concatenated:
51
Total amount of modifications:
10648

The first part of the sample previously looked like this:

$I7KUHX  =[tYpe]("{7}{2}{5}{10}{13}{1}{4}{9}{14}{12}{0}{3}{11}{6}{8}"-f 'mAR','opse','yst','sh','R','eM.RuNti','T','S','E','v','ME.','ALAsATtrIbu','ces.','intEr','I') ;
&("{0}{2}{1}"-f'sE','-iTeM','T')  
("V"+"Ari"+"Ab"+"LE:cF84") ([TYPe]("{2}{0}{1}{7}{9}{6}{10}{3}{4}{5}{8}" -F'yste','m.RU','s','es','.un','ManaGeDty','rV','nTiME.inTEroP','pe','se','iC')  );
$7eq=  [tyPe]("{1}{0}" -f '32','INT')  ;
&("{0}{1}" -f's','ET') tIAfhC  ([tyPE]("{0}{1}" -F'bO','ol')  )  ;
&("{0}{1}{2}"-f 's','ET','-VARIaBLE') 
kM5l ( [tYPE]("{0}{1}{2}"-F 'U','I','Nt32')  )  ;
$XD1h =[TYpE]("{1}{0}{2}"-f'NVE','BItco','rtEr');
&("{2}{1}{0}"-f'tem','ET-I','s') 
("VaRIA"+"BLE"+":Rb"+"h0")  ( [tYPE]("{1}{8}{6}{4}{2}{5}{9}{11}{10}{7}{12}{0}{3}" -F 'S','S','r','s','EM.','EFLecT','t','DEraCC','Ys','iOn.em','BlYbUIl','It.ASSEm','e'))  ; 
$eGj7  =  [tyPe]("{0}{1}{2}" -F 'aPPDOma','i','N');
&("{1}{0}{2}"-f 'eT-','S','itEM') 
VAriablE:tg58U ( [TYpE]("{8}{5}{4}{7}{3}{0}{2}{6}{1}" -F'n','gcOnvENtIoNS','.c','o','.REFLEC','sTeM','ALLin','ti','sy') );  
&("{0}{1}" -f 'S','eT-iTEm')  
variablE:urYi12 ( [tYPE]("{2}{3}{0}{1}" -F 'I','RONmENt','eN','V')) ;  
$9hRwNy  =  [tYpE]("{1}{0}"-f'R','uIntpt') ;
&("{0}{1}{2}" -f'SeT-i','te','m') 
("VARI"+"ABLe:6"+"3"+"Y")  ( [tyPe]("{1}{0}" -f'h','MAT') ) ;  
$MlHiT=[typE]("{5}{6}{4}{1}{2}{3}{0}"-F 'HAl','OpSe','R','vIcEs.mArs','R','syStEm.RunT','Ime.iNte');
&("{1}{0}" -f 'T','SE') 
T2NGf  ( [type]("{0}{2}{1}" -F 'IN','PTR','t')) ;
$j1v  =[tyPe]("{0}{1}"-F 'Co','NveRt') ; 
function iN`VokE`-r`F`BuxmE`HAEmZbhI

After the automated deobfuscation, it is readable:

$I7KUHX  =[tYpe]("SysteM.RuNtiME.intEropseRvIces.mARshALAsATtrIbuTE" ) ;   
&("sET-iTeM")  
("VARIABLe:63Y") ([TYPe]("system.RUnTiME.inTEroPserViCes.unManaGeDtype" )  );  
$7eq=  [tyPe]("INT32"  )  ;    
&("sET" ) 
tIAfhC  ([tyPE]("bOol" )  )  ; 
&("sET-VARIaBLE" ) 
kM5l ( [tYPE]("UINt32" )  )  ;    
$XD1h =[TYpE]("BItcoNVErtEr");    
&("sET-Item") 
("VARIABLe:63Y")  ( [tYPE]("SYstEM.rEFLecTiOn.emIt.ASSEmBlYbUIlDEraCCeSs"  ))  ; 
$eGj7  =  [tyPe]("aPPDOmaiN"  );
&("SeT-itEM" ) 
VAriablE:tg58U ( [TYpE]("sysTeM.REFLECtion.cALLingcOnvENtIoNS" ) );  
&("SeT-iTEm"  )  
variablE:urYi12 ( [tYPE]("eNVIRONmENt"  )) ;  
$9hRwNy  =  [tYpE]("uIntptR") ;
&("SeT-item" ) 
("VARIABLe:63Y")  ( [tyPe]("MATh" ) ) ;  
$MlHiT=[typE]("syStEm.RunTIme.iNteROpSeRvIcEs.mArsHAl" );
&("SET"  ) 
T2NGf  ( [type]("INtPTR"  )) ;
$j1v  =[tyPe]("CoNveRt" ) ; 
function iNVokE-rFBuxmEHAEmZbhI

This allows the analyst to analyse and refactor the script without the pesky obfuscation.


To contact me, you can e-mail me at [info][at][maxkersten][dot][nl], send me a PM on Reddit or DM me on Twitter @LibraAnalysis.