How to deobfuscate Blackhole Java-Script

[vgwort line=“71″ server=“vg08″ openid=“fc30999c3de74f278c4b67b36475990e“]

This article describes my investigations of the Java-Script obfuscation currently used by the Blackhole Exploit Kit. My intend was to write some smal scripts to automatically deobfuscate the Java-Script without using Java-Script itself. Using Java-Script with a little help would be an easy task because you only have to identify the point where the „eval“-fucntion is called and repalce it with an alert- or document-write function. My intend was to fully automate the deobfuscation to used it by an mail-scanner or a proxy-server.

I an former article I investigated the LinkedIn Spam and showed an example of a landing page assure_numb_engineers.php I downloaded from the server. Thanks to the daily spam and some web-sites listing the active servers I could get my hand on some more landing pages. Most of them looked like the example below.

In the next few paragraphs I will explain how the decoding and encoding works and develop explain a solution for decoding without deciphering the javascripts of a certain class.

How the pages look like

This is a modified example to explain of the general layout of the php page delivered by the landing server.

1 <html>
2 <head>
3 <title>
4 </title>
5 </head>
6 <body>
7 <div dqa=“asd“>
8 </div>
9 10 dd=“div“;
11 asd=function(){
12 a=a.replace(/[^012a-z3-9]/g,““);
13 };
14 ss=String.fromCharCode
15 </script>
16 <div
17 41=“..“ 38=“..“ 63=“..“ 50=“..“ 73=“..“ 84=“..“ 26=“..“ 66=“..“ 77=“..“ 81=“..“
18 56=“..“ 75=“..“ 55=“..“ 17=“..“ 5=“..“ 70=“..“ 58=“..“ 21=“..“ 1=“..“ 91=“..“ 76=“..“
19 57=“..“ 19=“..“ 31=“..“ 34=“..“ 45=“..“ 22=“..“ 15=“..“ 52=“..“ 9=“..“ 88=“..“ 27=“..“
20 10=“..“ 90=“..“ 28=“..“ 65=“..“ 30=“..“ 53=“..“ 67=“..“ 62=“..“ 3=“..“ 54=“..“ 61=“..“
21 39=“..“ 43=“..“ 89=“..“ 36=“..“ 85=“..“ 42=“..“ 32=“..“ 78=“..“ 0=“..“ 24=“..“ 47=“..“
22 44=“..“ 23=“..“ 92=“..“ 29=“..“ 87=“..“ 60=“..“ 59=“..“ 49=“..“ 83=“..“ 6=“..“ 46=“..“
23 74=“..“ 80=“..“ 51=“..“ 40=“..“ 4=“..“ 20=“..“ 13=“..“ 71=“..“ 69=“..“ 8=“..“ 33=“..“
24 72=“..“ 11=“..“ 82=“..“ 68=“..“ 14=“..“ 35=“..“ 16=“..“ 2=“..“ 64=“..“ 25=“..“ 18=“..“
25 12=“..“ 48=“..“ 37=“..“ 86=“..“ 79=“..“ 7=“..“>
26 </div>

27 <script>
28 if(020==0x10)a=document.getElementsByTagName(dd)[1];
29 s=““;
30 for(i=0;;i++){
31 if(window.document)r=a.getAttribute(i);
32 if(r){s=s+r;}else break;
33 }
34 a=s;
35 asd();
36 s=““;
37 for(i=0;i<a.length;i+=2){
38 s+=ss(parseInt(a.substr(i,2),31));
39 }
40 c=s;
41 e=window[“ev“+“a“+“l“]];
42 try{(„321″.substr+“zxc“)();}catch(gdsgdsg){e(c);}
43 </script>

44 </body>
45 </html>

I replaced the long strings with two dots. I inserted some line breaks, line numbers and color markings for better readability. But even with this formating you can see that the obfuscation is a bit lousy. Better for me. Other landing pages are a little bit trickier and harder to decipher.

The data is stored in the attributes of a tag

Lets start with the div-Tag in line 16. The tag has nearly 100 attributes named with a number. Every parameter is assigned a very long string. This is of no need for the browser, but you can easily store data in those parameters. The numbers are not ordered but there is no need to do it. The attribute values look like this:

0=“!3n3l3s3u3p^343l112i3f_3o3a3c3h26/383n38363n(1u3u3p383l@3m3c3i3h1r#131h1f1o1f..

Looks a bit complicated to crack the code. Strip of the Java-Script and it would be a nice task for a codebreaker. But when the script can do it, we can do it.

The strings contain the data for an encoded script which is passed to the eval-function in line 42. How is it decoded?

Cleaning the string from garbage

Lets start with the beginning of the script? In line 12 you see a regular expression replace(/[^012a-z3-9]/g,”“) which deletes all characters except ‚0‘ to ‚9‘ and ‚a‘ to ‚z‘ from a string. Guess which! When we apply this to the attribute values of the div tag they look much prettier.

3n3l3s3u3p343l112i3f3o3a3c3h26383n38363n1u3u3p383l3m3c3i3h1r131h1f1o1f

In the for-loop starting at line 30 all attributes are put together into one long string in their natural order. After that step all the unneeded characters are deleted in line 35 from the resulting string.

Decoding of the data to the script

The next for-loop starting in line 37 decodes the string.

Unobfuscated the loop looks like this one:


for ( i=0; i < a.length; i+=2 ) {
s+=String.fromCharCode(parseInt(a.substr(i,2),31));
}

This loop takes two characters form the start to the end of the string together, interprets them as an integer of the radix 31 and appends the result as a character to a new string.

How could we do this without JavaScript?

First step: Extracting the data

First we must extract the n=“..“ strings from the page. That can be done by a combination of grep and sed. Deleting the unneeded characters can be done with tr . sort will sort the strings and sed.

These thoughts lead us to the following script.

#!/bin/sh

# Blackhole decode Part 1

sed ’s# \([0-9]\{1,2\}=“[^“]*“\)#\n\1\n#g;‘ \
| grep ‚^[0-9]*=‘ \
| sort -n \
| sed ’s#^[0-9]*=“##; s#“$##;‘ \
| tr -dc ‚0-9a-z‘ \
| tr -d ‚\n‘

Second step: decoding the data

The second step needs a bit more thinking. We need script or program for the function parseInt. But wait a minute? What if the Radix is changend every time the landing page is requested? ParseInt accepts a a radix from 2 to 36. With pairing two digits for a new character only values from 16 to 36 are useful. We could easily try 21 possibilities and select the right one manually but the solution should not require manual interference. What can we do?

If the string is long enough we can assume that at least one character with the highest digit (the ‚9‘ in the decimal system) is present. With command line functions we could do this by splitting the string after each digit into lines, sorting the result unique and looking for the last line. This would be a pretty good guess.

I decided not to take this way. It would be a slow, resource consuming solution. To gain speed I decided for a C-Program to guess the radix. The program should prove the result against … what?

There are some possible test.

  1. The lowest printable character would be the blank if there were no line breaks. In my examples are no characters below blank. But this cold be adjusted. 0x0a would be the encoding for the carriage return for every radix greater 10.
  2. The braces in an Java-Script are equal, when they are not used in strings. Every opening brace ‚(‚ or ‚{‚ must be closed by ‚)‘ or ‚}‘. The counts for the opening and closing braces should be equal.

This leads to the following program to guess the radix.

/* (c) 2012 by Thomas Arend, 2012/10/25

 * Purpose: Guess the radix for parseInt from the input
 * Assumption: Highest digit is used at least once
 * Input: parseInt coded file
 * Output: possible radix for decoding with parseInt
 * Return-Codes: 
 *  0 everything well 
 *  1 input was not tidy
 *  2 with radix r the blank is not the lowest character
 *  4 opening ( not as much as closing )
 *  8 opening { not as much as closing }
 *  

 * The Toolkit Blackhole codes a Java-Script
 * in the attribute values of a tag.
 * Every two characters are interpreted as an integer and 
 * parsed with parseInt and fromCharCode into an new character
 * The radix for parseInt is obfuscated in the calling script.
 * Because we don't want to reassemble the obfuscated script
 * we have to guess the radix from the input.
 *
 * We assume that the highest digit is used in the input.
 * 
 * That the highest digit is not used has a very low probability in 
 * a large javascript. The useful range for the radix is 16 til 36.
 *
 * $Id: $
 * $Log:$
 */
#include <stdlib.h>
#include <stdio.h>
#include <string>


#define MAXRADIX 36
#define MAXCHAR 256

using namespace std;

char validdigits [MAXRADIX+1] = "0123456789abcdefghijklmnopqrstuvwxyz";

long usedchar [MAXCHAR] = { 0 } ;
long statistic[MAXRADIX][MAXRADIX] = { 0 };

int validdigit (int digit ) {

  if ('0' <= digit && digit <= '9')
    return 1 ;
  else if ('a' <= digit && digit <= 'z' )
    return 1;
  else
    if ( digit == 10 )
      return 1;
    else
      return 0;
  
}

int digitindex (int digit ) {

  if ('0' <= digit && digit <= '9')
    return (digit - '0');
  else if ('a' <= digit && digit <= 'z' )
    return (digit -'a' + 10);
  else
    return 255;

}

// Check if the input file consisted only of 0-9, a-z
  
int check_tidy_charset () {
  
  int isdirty = 0;
  int dirty = 0;
  
  for ( dirty = 1 ; dirty < MAXCHAR ; dirty++) {
    if ( !validdigit(dirty) && usedchar[dirty] ) {
      isdirty++;
    }
  }
  
  if (isdirty > 0) {
    printf ("Dirty characters %d\n", isdirty ) ;
    return 1;
  }
  else
    return 0;
  
}  

// If the code contains blanks then the ' '
// should be the lowest cahracter.

int blank_check ( int radix ) {
  
  int found = 0;
  int i = 0, j = 0;
  
  found = 0;
  for ( i = 0; i < MAXRADIX && !found ; i++ ) {
    for ( j = 0; j < MAXRADIX && !found ; j++ ) {
      found = statistic[i][j] > 0; 
    }
  }
  
  if ((i-1)*radix + j-1 != ' ') {
    printf ("Blank check failed at [%d,%d] = %d\n", i-1 , j-1 , statistic[i-1][j-1]) ;
    return 2;
  }
  else
    return 0;
  
}


// The characters ( and ) should have equal counts.

int partentheses_check ( int radix  ) {
  

  if (statistic['(' / radix]['(' % radix] != statistic[')' / radix][')' % radix]) {
    
     printf ("Parentheses '()' check failed with %d,%d\n",
	     statistic['(' / radix]['(' % radix] , 
	     statistic[')' / radix][')' % radix] ) ;
     return 4;
  }
  else
    return 0;
  
}

// The characters { and } should have equal counts.

int curly_brace_check ( int radix  ) {
  

  if (statistic['{' / radix]['{' % radix] != statistic['}' / radix]['}' % radix]) {
    
    printf ("Bracket '[]' check failed with %d,%d\n",
	    statistic['{' / radix]['{' % radix] , 
	    statistic['}' / radix]['}' % radix] ) ;
    return 8;
     
  }
  else
    return 0;
  
}

int main ( int argc, char *argv[ ])
{

  int figure = 0;
  int previous  = 0;
  int paired  = 0;
  int radix = 0;

  int dirty = 0;
  int isdirty = 0;
  int found = 0;
  
  int i = 0, j = 0;
  int error = 0;
   
  // Count all characters
  
  paired = 0;
  while (( figure = getchar()) != EOF ) {
    usedchar[figure]++;
    if (paired) {
      i = digitindex(previous);
      j = digitindex(figure);
      if (i < 255 && j < 255) { statistic[i][j]++;}
      paired = 0;
    }
    else {
      paired = 1;
      previous = figure;
    }  
  }

  // Seek highest character
  
  for ( figure = 255 ; ( figure > 0) && (usedchar[figure] == 0); figure-- ) {}

  // Print radix 

  radix = digitindex (figure) + 1;
  printf ( "%d\n" , radix );
  
  // Check input and guess 
  
  error += check_tidy_charset();
  error += blank_check(radix);
  error += partentheses_check(radix);
  error += curly_brace_check(radix);

  return error;
  
}

Listing: piRadix

Decoding the data

I decided for a second C-program piDecode to decode the data.

/* (c) 2012 by Thomas Arend, 2012/10/25
 *
 *
 * Purpose: Decode parseInt encoded input file
 * Parameter: radix for parseInt
 * Input: parseInt encoded file
 * Output: decodet file
 *

 * The Toolkit Blackhole codes a Java-Script
 * in the attribute values of a tag.
 * Every two characters are interpreted as an integer and 
 * parsed with parseInt and fromCharCode into an new character
 * 
 * The radix can be guessed with the program piRadix
 
 * $Id: $
 * $Log:$
 */
#include <stdlib.h>
#include <stdio.h>
#include <string>

using namespace std;

int digittoint ( int digit ){

  if ('0' <= digit && digit <= '9')
    return (digit - '0');
  else if ('a' <= digit && digit <= 'z' )
    return (digit -'a' + 10);
  else
    return 255;

}

int char2toint ( int z1, int z2, int radix) {

  return (digittoint(z1) * radix + digittoint(z2)) ;

}
int main ( int argc, char *argv[ ])
{

  int character1 = 0;
  int character2 = 0 ;
  int code  = 0;
  int radix = 16 ;

  if (argc > 1 ) {
    radix = atoi(argv[1]);
  }
  else {
    radix = 16;
  }

  while (( character1 = getchar()) != EOF ) {

    if ( (character2 = getchar()) == EOF ) break;

    code = char2toint ( character1, character2, radix );

    if ( code < 256 ) {
      putchar(code);
    } else {
        putchar(code >> 8);
        putchar(code & 256);
    }

  }
  printf ( "\n" ) ;
  return 0;

}

Listing: piDecode

This program is pretty simple. There could be some improvement and error check but it works when the input is fine.

The „blackhole“ script part 2

This lead to the final solution for the blackhole script. It has to be called twice. First run without a parameter to guess the radix and second run to decode the input with a given radix.

#!/bin/sh

# Blackhole decode Part 1

if [ -z "$1" ]
then 
  CMD="piRadix"
else
  CMD="piDecode"
fi  

sed 's# \([0-9]\{1,2\}="[^"]*"\)#\n\1\n#g;' \
| grep '^[0-9]*=' \
| sort -n \
| sed 's#^[0-9]*="##; s#"$##;' \
| tr -dc '0-9a-z' \
| tr -d '\n' \
| $CMD $1

Listing: blackhole.sh

Example:

thomas@x1:~> blackhole.sh <term_covering.php
30
thomas@x1:~> blackhole.sh 30 <term_covering.php
try{var PluginDetect={version:“0.7.8″,name:“PluginDetect“, …

There are some nearly the same obfuscations with the relaying pages. They have to be handled differently but the two C-programs do there work, when then data is extracted. Here an example from a relaying page.

v=window;
try{dsfsd++}
catch(wEGWEGWEg){
try{(v+v)()}
catch(fsebgreber){
try{v[„document“][„body“]=“123″}
catch(gds){m=123;if((alert+““).indexOf(„native“)!==-1)ev=window[„e“+“val“];
}
}
n=[„53″,“45″,“4m“,“23″,“2f“,“26″,..,..,..,];
h=2;
s=““;
if(m)for(i=0;i-105!=0;i++){
k=i;
if(window[„document“])s+=String[„fro“+“mCharCode“](parseInt(n[i],23));
}
try{febwnrth–}
catch(bterste){
alert(s);
}
}

The modification of the blackhole shell script to detect this encoding and extract the data should be not the greatest task.

Attached you find the Blackhole-decode source code in a ZIP-archive.
Use at your own risk. Upps, I forgot a license info. Will ad it tomorrow. Will be GPL.

Good night!

Update 2012-10-26: This morning In found a post from 18th October 2012 on Lab69.com „Blackhole v2 Deobfuscation from Ruby Perspective“ which gives credit to Hooked on Mnemonics and the post Deobfuscating BlackHole V2 HTML Pages with Python. I worked independent from these approaches. The main difference is that I try to guess (or calculate) the radix for the parseInt-function from the input.

My programs read the file character for character. There programs load the whole file into a string. In their approach it should be easy to calculate the radix by scanning the string for the highest digit before decoding it.

3 Kommentare

Kommentare sind geschlossen.