The vast majority of websites are built on top of existing platforms such as Wordpress , Joomla and Drupal .

Sometimes, because of the use of an older version of the platform (Drupal for example) or because of certain plugins used (or insecure code), these sites can become vulnerable to attacks and might eventually be compromised.

In the post-attack phase, there may be malicious files present, and those need to be cleaned up.

This post will focus on determining provenance of PHP files, and performing static analysis on PHP code to assess if it presents a risk.

The code in malicious files may contain function calls like this eval/*random string*/(arguments), which is valid PHP syntax, but many of these atypical code fragments are present especially in obfuscated code so using regexes to find function calls can be complicated and prone to error.

A regex approach is unlikely to be aware of the code structure and might return false matches.


A PHP grammar 1 is used to generate a parser that is then used to parse the code.

We then store the parse trees in a specialized XML datastore. We can then query the parse trees using XQuery and XPath to determine certain characteristics and create a report to help in narrowing down potentially malicious files.

We also use an SQLite database that will store checksums of known PHP files. This will allow us to speed up the entire process and obtain provenance information.


Initially, the XML library Saxon was used, but since it's only a library, and it does not offer indexing or any features that would allow fast searches on the XML trees.

This SO thread points out that Saxon is not a database.

Probably you should consider using XML databases like Sedna, eXist? As you mentioned Saxon is processor, not a storage. [..] I don't think you can optimize that, and no, Saxon is not a database.

In this article, the point is made that Saxon is not designed to index the XML data:

It should be noted that Saxon is not a database product. Its raw material is XML held in unparsed form in filestore, or sent over the wire. This means that Saxon does not have the luxury of maintaining persistent indexes or collecting statistical data for use by its optimizer; it has to take the data as it comes

Another reason for not using an XPath/Xquery library is that it would require sequentially loading trees and querying them since these structures can take up a lot of memory. In addition, a datastore would also allow to refine and re-run the queries, without incurring the cost of parsing the code again.

So, while there are many XML libraries for Java, each with different features, an XML store was needed here.

Because of these facts, Saxon was replaced with BaseX. This gives the ability to both process XML, run queries on it, and index it for fast retrieval.

Some examples of querying the AST

Since the parse trees are stored in BaseX, they can be visualized and queried. Here's how one file in the Joomla codebase looks like:

Here are some ways to extract specific information from this parse tree, for example:

  • getting variable initializations //variableInitializer//*[matches(text(),"^\$")
  • getting varialbes that are referred to throughout the code //keyedVariable//*[matches(text(),"^\$")
  • getting all function calls //functionCall//identifier.
  • getting all function variable syntax calls //functionCall//functionCallName//chainBase//keyedVariable .

Building a checksum database

In order to skip files that are known to be safe, we can build a database of checksums. SQLite is a good fit for storing these.

Fortunately, the version control system, Git, stores checksums for all files and each of their versions, and the major platforms (Wordpress/Joomla/Drupal) all have their main repositories or mirrors of their repos versioned in Git.

There's still the problem that on downstream (packages for different linux distributions), the codebase packaged by the distribution maintainers, for example Debian's Wordpress repository, might have patches applied to it that didn't make their way back upstream.

In this case, we can include the downstream repository when we build the checksum database.

In summary, the SQLite checksum database will store, in one table, checksums for all versions of each file found in the selected Git repositories 2

Parse tree heuristics, real break-ins and PHP payload characteristics

This is a short snippet 3 of malicious PHP code that was found on a server. Most variable names look random. The eval function is used and there's a decoding process that takes place at the end. The base64_decode function is also used to decode the data that's stored in the $mafdk variable. For brevity, the full strings were skipped in the right-handside of the assignment for the $mafdk variable. In this particular case , the string concatenation spanned many lines such that the largest part of this file was actually just string data. This is another hint that can be used.

$uwvcvujcdx = 9738;
function iuxwhbyf($dafokjfp, $voclsaqdh){
    $uzfcu = '';
    for($i=0; $i < strlen($dafokjfp); $i++){
        $uzfcu .= isset($voclsaqdh[$dafokjfp[$i]]) ? $voclsaqdh[$dafokjfp[$i]] : $dafokjfp[$i];
$whbwesyeu="base" . "64_decode";return $whbwesyeu($uzfcu);}
$mafdk = '9lIdXj57e391uzjJL05J3zaPeJLZCwnjTwx8ixQE9lIdXj57e391uzaPeY5ILvuPLvNvH4'.
$dylov = Array('1'=>'o', '0'=>'m', '3'=>'X', '2'=>'q', '5'=>'9', '4'=>'C',[..]);
eval/*jyyiwm*/(iuxwhbyf($mafdk, $dylov));?>

As part of this project, we analyze the string literals being used, and compute how much of the data stored in strings is base64-encoded or how much of it is encoded as hex-strings.

Another check we perform is related to the function names and function call counts for each parse tree. Here is a snippet of that:

let $partial := 
	element root {
	for $doc in db:list("xtrees")
	where matches($doc,"^unknown/.*\.php$")
	(:~ all function call nodes in the document :)
	let $anodes := db:open("xtrees",$doc)//functionCall//identifier
	(:~  variable function calls :)
	let $fvar   := db:open("xtrees",$doc)//functionCall//functionCallName//chainBase//keyedVariable
	(:~ names of the function calls :)
	let $tnodes := $anodes//text()
	(:~ distinct values thereof :)
	let $dnodes := distinct-values($tnodes)
	for $func in $dnodes
	let $cnt := count($anodes//[text()=$func])
	group by $doc

	(: variable syntax function call score :)
	let $fvarscore := 
	  if(count($anodes) = 0)
	  then 0
	  else (count($fvar) div count($anodes))

	let $elem := element file {
	  attribute path { $doc },
	  attribute fvarscore { $fvarscore },
	  $func ! (
			 retain implicit iterator in the $x variable for usage inside
			 this scope (to avoid ambiguity)
		  let $x := .
			element function {
			  attribute name  { $x } ,
			  attribute count {
				count(filter($anodes//[text()=$x], function($y) {$y = true()} )) 
	return $elem


Obfuscators are used to conceal the logic carried by a program's source code. But in most cases, what obfuscators leave behind is a rigid structure. This structure is the cue.

Since we're interested in structure, if we get an XML skeleton and discard the actual text nodes, we're left with just the tree structure minus the content.

If we do that for a program's parse tree once, we can reuse that information for lookups. So if we've already computed the structure for program A, then we can easily answer if program A and B have the same structure 4. Such a metric of similarity can be made more flexible, allowing the two programs to differ to some extent.

Syntax tree fingerprinting (aka MAST, or Merkleized abstract syntax trees)

This transformation will build a MAST(Merkelized abstract syntax tree) 5. We'll use this later on in order to identify matching portions of logic in different programs.

This is a summary of the bottom-up construction of a merkle tree: start from leafs, assign the leafs a hash based on their node name. Then, progressively, for each higher level, take a node, concatenate its name with the previously computed hashes for all of its children, compute a hash of that, and then assign the hash to the node. Continue the entire process up to, and including the root node.

The main use-case 6 here is identifying files that have the same structure. However, because we now have checksums of sub-trees (not just the root or top-level checksum for the entire AST), we can also check for similar code-blocks.

The diagram on the left (right above) describes how the original tree looks like, while diagram on the right explains how the Merkle tree looks.

Here is an XQuery implementation of the algorithm described above.

declare function local:mast($node as node(), $d) {
	case element()
		  prevent a stack overflow
		  (the trees we operate on usually have 
		   max depth 500)
	  if($d le 3000)
	  then (
		(:~ compute all the children recursively :)
		let $newc := (
		  for $c in $node/node()
		  return local:mast($c, $d + 1)
		(:~ get the children's hashes :)
		let $nc := string-join($newc ! @h,'')
		(:~ get the current node name :)
		let $nn := node-name($node)
		(:~ concatenate current node name with the children hashes :)
		let $msg := $nn || $nc
			compute a hash for the entire subtree rooted 
			in the current node
		let $subtree_hash := crypto:hmac($msg,'hardcodedkey','md5','hex') 
		return element {$nn} {
		  attribute h {$subtree_hash},
	case text() return ()
	default return ()

(:~ usage example :)

Here's how the regular AST looks like compared to the MAST. Both trees are stored in BaseX (and the MAST was computed with the code above). The MAST was computed by ignoring the leaf nodes, and adding the checksum attribute to each node. Each node also carries a start and end for the start line number and the end line number, for the parse rule associated with that node. Using the start/end attributes we can refer back to the part of the source code that matched one of the checksums.

Other static analysis tools

Facebook has developed pfff which is a static analysis toolset for code analysis, visualization and source transformations.

As part of pfff, sgrep stands out as an augmented version of grep that understands function definitions, function calls (among other things).

The spatch utility also looks very promising. It could be used to wrap certain unsafe calls or to allow insertion of logic into existing code. For example, if encountering calls like eval() or like $functionName(), you could replace them with your own function with your own custom code.

You can also do more complex queries such as finding all classes which extend some parent class, and which contain a function which calls another function.


We've looked at some ways of doing static analysis on PHP code in the aftermath of a server break-in. We've seen how a specialized datastore can greatly aid in storing hierarchical structures such as parse trees, then we've looked at a way of fingerprinting code.

The code is available on github under MIT license.



For comparison, this is the official grammar for PHP 5.6.26


debsums works the same way, by checking if the installed files originating from distribution packages have the expected checksums


This post describes an intrusion and how it was handled.


Although outside the context of this project, this could also be repurposed and be used to assess code redundancy in a large codebase.


While working on this project, I came across a paper that describes the same process of indexing code, storing tree structures and creating hash trees to find duplicate/similar code for the purpose of refactoring.

Apart from describing the entire process, they also focus on the choice of the hash function (in section 4.2) and even propose a structurally-sensitive hash function (a hash function that takes into consideration the structure of the tree, and takes into account the equivalence of certain operations).


These trees could also be used to track web page structure, but that's outside the scope of this post.