Lab #7: Antivirus with YARA

Due: Friday, Dec 2 11:59PM

NOTE: This is an optional extra credit assignment! It is worth 5 points of extra credit towards your grade.

YARA will appear as one of the topics you can pick from on the final, so you may be interested in looking over this assignment nonetheless.

Lab structure

In this lab you will be using YARA to write some rules to detect malware. In particular, you will write some rules to detect the linux/x64/meterpreter_reverse_http payload for Metasploit that you’ve used in some of the earlier labs.

Before you get started, you should check out the appendix on YARA rules for a crash course on using YARA. Throughout this lab, you can use the following page as your primary reference for how to write YARA rules:

https://yara.readthedocs.io/en/stable/writingrules.html

You may also find some of the other references in the appendices useful for figuring out what your rules should look like.

What to submit

At the end of this assignment, you should submit a PDF document with your YARA rules for each problem.

Grading

This assignment is worth 5 total points of extra credit. Points will be awarded based on completion – as long as you make a good-faith effort to complete each problem, you should get full points.


Setting up

To get started, we will generate some toy “malware samples” using Metasploit’s linux/x64/meterpreter_reverse_http payload. We will spend the rest of the lab writing YARA rules for this payload.

Generating samples with msfvenom

We’ve been using malware in some form or another since Lab 3. Remember msfvenom1? Every time we’ve been using Metasploit to perform remote code execution, it’s been generating malicious code (in some form or another) and executing it on the target. Do you ever remember seeing Metasploit print a log like this in Lab 3 or Lab 6?

msf > run
...
[!] This exploit may require manual cleanup of '/tmp/EvkPk' on the target

meterpreter> 

That shows up because Metasploit uploaded its malicious payload to (in this case) /tmp/EvkPk. In the real world, we would collect these samples from the exploited host and analyze them. For convenience’s sake, though, in this lab we’ll artificially generate our own samples using msfvenom (since that’s what Metasploit is using under the hood anyways). Run the following command:

# Create a directory to store samples in your home directory
mkdir -p ~/samples

Now run the following command 2-3 times:

msfvenom --arch x64 \
  --platform linux \
  --format elf \
  --payload linux/x64/meterpreter_reverse_http \
  LHOST=$(hostname) LPORT=4444 \
  -o $(mktemp -up ~/samples)

Every time you run this command, it will create a new, randomly-named copy of the msfvenom payload in your ~/samples/ directory.

We’re going to write a few different rules to detect these samples.


Problem 1: strings-based rule

Ready? If you haven’t, I would recommend reading the intro to YARA rules in the appendix before starting.

The strings command searches for all of the text strings that it can find in a file. Run

strings -n 30 ~/samples/my_sample | sort | uniq

(Replacing my_sample with an actual sample from your directory.) strings -n 30 ... extracts all strings that are least 30 characters or longer from the malicious binary, while ... | sort | uniq filters those strings to remove duplicates.

Choose at least four of the strings that stand out as unique. Then create a file, e.g. myrules.yar, with a YARA rule to detect them. Your YARA rule should look something like the following (for example):

rule Msf_Linux_MeterpreterReverseHttp_strings {
  meta:
    description = "linux/x64/meterpreter_reverse_http - strings"
    author = "Your Name Here"

  strings:
    // Put the strings that you extracted here!
    $s1 = "string 1"
    $s2 = "string 2"
    $s3 = "string 3"
    /* ... */

  condition:
    // Add a condition using these strings to ensure that malware
    // samples get correctly identified.
    false
}

See the section on testing your rules to see how to check that your rules work correctly.


Problem 2: function-based rules

By default, the payloads generated by msfvenom have debug symbols in them2. This makes it fairly straightforward to identify functions from the original source code of the malware. Try running the following command (with my_sample replaced by the name of an actual msfvenom payload sample):

objdump -j .text -d ~/samples/my_sample | less

less will run a terminal pager that makes it easier to look through the output of objdump. You can use your arrow keys or the “page up” / “page down” keys on your keyboard (if you have them) to scroll through the output. You can also press the / key to search for a term, and q to exit.

You should see some output like this:

0000000000007610 <get_protocol_family>:
    7610:       81 ff 00 20 00 00       cmp    $0x2000,%edi
    7616:       89 f8                   mov    %edi,%eax
    7618:       0f 84 39 01 00 00       je     7757 <get_protocol_family+0x147>
    ...

0000000000007784 <strip_trailing_dot>:
    7784:       48 8b 57 10             mov    0x10(%rdi),%rdx
    7788:       48 85 d2                test   %rdx,%rdx
    778b:       74 24                   je     77b1 <strip_trailing_dot+0x2d>
    ...

This output is telling us that the function get_protocol_family starts at byte 0x7610 in the binary, and its definition runs until byte 0x7784 (when the strip_trailing_dot function starts). It also tells us the assembly code instructions in those functions along with their hex representation.

Pick out at least two functions, and write a YARA rule that identifies binaries with those functions. You should use the hex representations of their assembly code in your rule, so that your YARA rule looks something like this:

NOTE: you don’t want to have to copy the entire hex representation of the functions by hand! I’ve added a little trick in the appendix (“extracting the bytes of a function”) that you can use to make this process much faster.

rule Msf_Linux_MeterpreterReverseHttp_funcs {
  meta:
    description = "linux/x64/meterpreter_reverse_http - functions"
    author = "Your Name Here"

  strings:
    // Put the functions that you extracted here!
    $f1 = { 81 FF 00 20 ... }
    $f2 = { 48 8B 57 10 ... }
    /* ... */

  condition:
    // Add a condition using these strings to ensure that malware
    // samples get correctly identified.
    false
}

Problem 3: syscall-based rules

NOTE: you don’t have to write your own rule for this one, but you should still follow along with the steps. It’s mostly meant to give you an opportunity to look at some other ways you can analyze malware. There will be a YARA rule at the end that you should add alongsid eyour other rules.

We’re going to try one final type of rule, which will look at the system calls that a program makes. Sometimes malware authors like to be sneaky and try all kinds of tricks to make their malware harder to analyze. One of those tricks is to compress and encrypt the malware. When the malware runs, the malware decrypts and decompresses itself in memory before executing the bulk of the malicious payload.

From an attacker’s point of view, they make analysis a little bit more difficult by obfuscating the malicious payload. And by performing everything in-memory, they ensure that the non-obfuscated payload never gets stored on disk (where a forensic investigator would be able to recover the malware). There are other benefits, too: since the majority of the payload is encrypted it’s impossible to develop a good YARA rule for that portion of the payload. And the compression reduces the payload size and (potentially) raises fewer alarms.

For our final rule, we will try to detect a malware sample that uses some of these tricks. For this we’ll use a little program called tardis that I’ve thrown together3:

# tardis will compress and encrypt the contents of my_sample
tardis ~/samples/my_sample ~/samples/packed

# Give execute permissions for the newly-created sample
chmod a+x ~/samples/packed

If you run objdump on ~/samples/packed, you’ll notice that you can’t see the functions that are defined for the binary anymore.

Still, there are other ways we can inspect what it does. strace will run this sample and tell us what syscalls it performs:

$ strace -b execve ~/samples/packed
execve("./out", ["./out"], 0x7ffe57fc7180 /* 36 vars */) = 0
brk(NULL)                               = 0x5555564e3000
...
mmap(NULL, 1069056, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4677cc4000
munmap(0x7f4677dc9000, 720896)          = 0
memfd_create("a", MFD_CLOEXEC)          = 3
write(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0u\216\0\0\0\0\0\0"..., 1068640) = 1068640
execveat(3, "", ["./out"], 0x5555564e5160 /* 0 vars */, AT_EMPTY_PATH
strace: Process 161899 detached

The salient lines are the lines where the malware calls memfd_create, write, and execveat. In particular, memfd_create and execveat work in tandem to allow the malware to create a region of memory where it can store its decrypted, decompressed payload, and then execute that payload.

Our last YARA rule will try to find programs that perform these two syscalls. Try running the following command:

gdb -x /usr/local/share/cs3710/script.gdb ~/samples/packed

This command uses the GNU Debugger to execute the sample and trace its execution. script.gdb automates the process for you so that you can see the relevant parts of the output, which should look similar to the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
...

Catchpoint 1 (syscall 'memfd_create' [319])
Catchpoint 2 (syscall 322)

Catchpoint 1 (call to syscall memfd_create), 0x00007ffff7ededaf in ?? ()
Dump of assembler code from 0x7ffff7eded9b to 0x7ffff7ededc3:
   0x00007ffff7eded9b:  00 48 8d        add    %cl,-0x73(%rax)
   0x00007ffff7eded9e:  3d 8f dc 0d 00  cmp    $0xddc8f,%eax
   0x00007ffff7ededa3:  be 01 00 00 00  mov    $0x1,%esi
   0x00007ffff7ededa8:  b8 3f 01 00 00  mov    $0x13f,%eax
   0x00007ffff7ededad:  0f 05   syscall
=> 0x00007ffff7ededaf:  48 89 c7        mov    %rax,%rdi
   0x00007ffff7ededb2:  48 63 c7        movslq %edi,%rax
   0x00007ffff7ededb5:  48 39 f8        cmp    %rdi,%rax
   0x00007ffff7ededb8:  0f 85 b3 0b 00 00       jne    0x7ffff7edf971
   0x00007ffff7ededbe:  89 bc 24 38 03 00 00    mov    %edi,0x338(%rsp)
End of assembler dump.

Catchpoint 2 (call to syscall 322), 0x00007ffff7edf2a1 in ?? ()
Dump of assembler code from 0x7ffff7edf28d to 0x7ffff7edf2b5:
   0x00007ffff7edf28d:  48 8d 35 bd f4 0d 00    lea    0xdf4bd(%rip),%rsi
   0x00007ffff7edf294:  b8 42 01 00 00  mov    $0x142,%eax
   0x00007ffff7edf299:  41 b8 00 10 00 00       mov    $0x1000,%r8d
   0x00007ffff7edf29f:  0f 05   syscall
=> 0x00007ffff7edf2a1:  0f 0b   ud2
   0x00007ffff7edf2a3:  48 c1 e3 20     shl    $0x20,%rbx
   0x00007ffff7edf2a7:  48 83 cb 02     or     $0x2,%rbx
   0x00007ffff7edf2ab:  49 89 df        mov    %rbx,%r15
   0x00007ffff7edf2ae:  4c 89 bc 24 00 01 00 00 mov    %r15,0x100(%rsp)
End of assembler dump.

I’ve highlighted the two areas where the memfd_create and execveat syscalls are executed. We can write YARA rules for the assembly used to perform the syscalls, similar to what we did in Problem 2.

At a minimum, our YARA rules should contain the part where we move the code corresponding to the syscall (0x13f for memfd_create, 0x142 for execveat) into the %eax register, and then perform the syscall instruction. We can use the assembly printed by GDB above to create the following YARA rule:

rule Memfdcreate_Execveat_Syscalls {
  meta:
    description = "program uses the memfd_create and execveat syscalls"
    author = "Your Name Here"

  strings:
    // memfd_create syscall:
    //   mov $0x13f,eax
    //   syscall
    $s1 = { b8 3f 01 00 00 0f 05 }

    // execveat syscall:
    //   mov $0x142,eax
    //   mov $0x1000,%r8d
    //   syscall
    $s2 = { b8 42 01 00 00 41 b8 00 10 00 00 0f 05 }

  condition:
    /* 
     * The uint32(0) == ... condition is another way of checking
     * whether a file is an ELF file. It's equivalent to the
     * "$elf at 0" rule in the appendix.
     */
    uint32(0) == 0x464C457F
    and all of ($s*)
}

You should add this rule to your rules file, alongside the rules from Problems 1 and 2.


Last steps: check your rules

Ensure that you have at least 2 or 3 different malware samples in your ~/samples directory. To verify that your rules for Problems 1-3 worked correctly, run them against the generated samples as follows:

yara myrules.yar ~/samples

If your rules work correctly, you should see some output like the following:

Msf_Linux_MeterpreterReverseHttp_strings ./tmp.rZ2v9Lg3GH
Msf_Linux_MeterpreterReverseHttp_funcs ./tmp.rZ2v9Lg3GH

This indicates that both of your rules matched against the samples that you generated.

As a sanity check, you should also try running your rules against some other files on your machine. If your rules are correct, they shouldn’t flag any files that aren’t in your ~/samples directory. I would recommend trying the following:

yara -r rules.yar /usr/bin

# NOTE: this one will take a while. If it's gone for 15+ minutes
# and hasn't printed any matches, you can call it good and
# Ctrl + C out of the command.
yara -r rules.yar /usr/lib

What to submit

Once you’ve verified that your rules work correctly, submit a document with your rules for each problem. If your rules flagged any additional files beyond the malicious files in ~/samples/, you should also indicate that in your submission.

Hints

To make your rules a little faster while performing your checks, you might want to add a couple of conditions that filter out files that aren’t ELF files. You could add the following condition to each of your rules:

condition:
  uint32(0) == 0x464C457F
  and /* ... additional conditions here ... */

or equivalently,

strings:
  $elf = "\7fELF"
  /* ... */

condition:
  $elf at 0
  and /* ... additional conditions here ... */

In addition, you might want to skip files that are above a certain size. You can use the filesize condition for that – e.g., you could filter out all files that are larger than 100MB.


Appendix

Intro to YARA rules

YARA bills itself as “the pattern matching Swiss knife for malware researchers”. It is indeed a flexible tool. It allows you to define a list of patterns (“rules”) matching the behavior of various known malware families, run those rules over some data, and report any data that matched those rules.

What makes YARA special is that it’s designed to be usable in many different contexts. You can write YARA rules to match malware files, but you can also write YARA rules that match patterns in network traffic or RAM. And it’s really fast – you can match against thousands of YARA rules with fairly low overhead.

Here’s an example of a YARA rule:

rule ElfRule {
  meta:
    description = "File Magic - ELF file"

  strings:
    // Equivalent: $magic = "\x7fELF"
    $magic = { 7F 45 4C 46 }

  condition:
    $magic at 0
}

All this rule does is check whether a file begins with the four bytes [0x7F, 0x45, 0x4C, 0x46]. This is a “magic number” used to identify a file as an ELF file, which is the standard format for binary executables used by Linux4.

You should try running these rules5 on your machine:

# Create a file, `rules.yar`, and copy-and-paste the rule
# shown above into the file.
nano rules.yar

# The first command runs the rules on all files in /usr/bin, the
# second command runs the rules on all files in /etc (with -r so
# that it recursively search subdirectories)
#
# Note: need sudo in the second command because not all files in
# /etc are readable by everyone
yara -r rules.yar /usr/bin
sudo yara -r rules.yar /etc

You will find that the first command prints a ton of results, while the second prints none (or almost none). That’s because the /usr/bin directory contains many of the programs you use on a Linux system (the majority of which are formatted as ELF files). Meanwhile, /etc contains files related to system configuration, which are overwhelmingly not ELF files.

Every time YARA finds a file that matches a rule, it prints the file and the rule that it matched.

Let’s look at one more example:

rule IsShellScript {
  meta:
    description = "Is a shell script"

  strings:
    $s1 = /#!\/bin\/(sh|dash|bash|fish|zsh)/
    $s2 = /#!\/usr\/bin\/(sh|dash|bash|fish|zsh)/

  condition:
    filesize < 1MB and (($s1 at 0) or ($s2 at 0))
}

If you’ve never seen a regular expression before this can be a little disorienting. What this rule does is check whether a file is smaller than a megabyte. If it is, it then checks whether that file starts with one of the following strings:

  • #!/bin/sh, #!/bin/dash, #!/bin/bash, #!/bin/fish, #!/bin/zsh
  • #!/usr/bin/sh, #!/usr/bin/dash, #!/usr/bin/bash, #!/usr/bin/fish, #!/usr/bin/zsh

Files that begin with these sequences are shell scripts that run a sequence of Linux commands, e.g.

#!/bin/bash
# A very short shell script that just deletes all of the files
# in the /tmp ("temporary") directory
#
# (Don't actually run this, this is just an example)

echo "[$(date -R)] Starting cleanup..."
rm -vrf /tmp/*
echo "[$(date -R)] Cleanup finished"

If you add these rules to rules.yara and run it over /usr/bin again, you’ll see that the IsShellScript rule is now picking up a lot of files that it didn’t see before:

yara rules.yar /usr/bin | grep IsShellScript

# Might print out something like:
#
# IsShellScript /usr/bin/fgrep
# IsShellScript /usr/bin/msf-exe2vbs
# IsShellScript /usr/bin/xfce4-popup-windowmenu
# IsShellScript /usr/bin/msf-msf_irb_shell
# ...

Extracting the bytes of a function

Here’s an example where I extract the assembly code of the function eio_init from a payload generated by msfvenom. objdump shows me the following information about this function:

0000000000007b9c <eio_init>:
    7b9c:       48 83 ec 08             sub    $0x8,%rsp
    7ba0:       48 89 3d 91 0f 2d 00    mov    %rdi,0x2d0f91(%rip)        # 2d8b38 <eio_want_poll_cb>
    7ba7:       48 8d 3d 22 c7 2d 00    lea    0x2dc722(%rip),%rdi        # 2e42d0 <eio_pool+0x170>
    7bae:       48 89 35 7b 0f 2d 00    mov    %rsi,0x2d0f7b(%rip)        # 2d8b30 <eio_done_poll_cb>
    7bb5:       31 f6                   xor    %esi,%esi

    # skipping a bunch of lines...

    7c5c:       48 c7 05 5d c6 2d 00    movq   $0x0,0x2dc65d(%rip)        # 2e42c4 <eio_pool+0x164>
    7c63:       00 00 00 00 
    7c67:       c7 05 5b c6 2d 00 00    movl   $0x0,0x2dc65b(%rip)        # 2e42cc <eio_pool+0x16c>
    7c6e:       00 00 00 
    7c71:       5a                      pop    %rdx
    7c72:       c3                      ret

0000000000007c73 <timers_reschedule>:
    7c73:       8b 8f bc 01 00 00       mov    0x1bc(%rdi),%ecx
    7c79:       31 c0                   xor    %eax,%eax
    ...

This function starts at byte 0x7b9c and ends at byte 0x7c73. Therefore I start by setting the following two variables in my terminal:

START=$((0x7b9c))
END=$((0x7c73))

Now I use the program xxd to produce a hex dump of the file. Then I pipe this output to the fold and tr programs so that it’s formatted in a way that will make it easy to use in my YARA rule:

# fold -w 2 groups characters into groups of 2
# tr '\n' ' ' replaces all newlines with spaces
xxd -u -p -s $START -l $(($END-$START)) my_sample \
  | fold -w 2 | tr '\n' ' '

(You can check out man xxd to see what each of the flags to xxd mean, if you’re curious.) In my case when I ran this command, I got the following:

$ START=$((0x7b9c))
$ END=$((0x7c73))
$ xxd -u -p -s $START -l $(($END-$START)) tmp.V35epF7Hkd \
    | fold -w 2 | tr '\n' ' '
48 83 EC 08 ... (many bytes later) ... 00 00 5A C3

Here’s an example of a YARA rule that checks for the existence of this function in a binary. Note that for your actual YARA rule for Problem 2, you should check for the existence of at least two different functions.

rule Is_Msf_Payload_Function {
  meta:
    description = "linux/x64/meterpreter_reverse_http - functions"

  strings:
    // ELF header
    $elf = "\x7fELF"

    // eio_init
    $f1 = {
      48 83 EC 08 48 89 3D 91 0F 2D 00 48 8D 3D 22
      C7 2D 00 48 89 35 7B 0F 2D 00 31 F6 E8 48 54
      /* a few lines later... */
      5D C6 2D 00 00 00 00 00 C7 05 5B C6 2D 00 00
      00 00 00 5A C3 
    }

  condition:
    // You don't strictly need to check whether the start of the
    // file begins with the ELF file header, but in practice it
    // can be a good idea for efficiency's sake
    $elf at 0 and $f1
}

Additional references

Your main reference for how to write YARA rules should probably be the YARA documentation:

https://yara.readthedocs.io/en/stable/writingrules.html

If you want to see some real-world examples of YARA rules, here are some GitHub repositories you can check out:

  • awesome-yara: this repository has a list of many different companies and projects using YARA, including links to YARA rulesets.

  • Yara-Rules/rules: this is a massive repository with rules for many different malware families.

  • Neo23x0/signature-base: another repository of many different YARA rules.


  1. Some people might be pedantic and claim that msfvenom is a payload generation tool rather than a malware generation tool. My personal definition of malware is broad enough that I would define an msfvenom payload as malware, but to each their own. In any case, it’s the kind of tool that defenders want to create detections for. ↩︎

  2. In the real world, malware samples don’t typically have debugging symbols embedded in them. However, you can still use a reverse-engineering tool like Ghidra or Binary Ninja to identify individual functions and use the approach we’re taking in this problem. ↩︎

  3. I threw this together for a competition a little while back. tardis is what’s commonly referred to as an executable packer, although in this case it also self-encrypts. Incidentally, the memfd_create + execveat method it uses is fun (and “easy” enough that you can quickly write your own packer based on it if you’re under time constraints), but it isn’t that sophisticated, at least as far as these techniques go. You can find out more about similar methods by looking up reflective code loading (MITRE ATT&CK T1620; DS0009). ↩︎

  4. The Windows equivalent to ELF is the Portable Executable (PE) format. If you’re interested in obscure Linux features, the Linux kernel supports something called binfmt_misc which allows Linux to run other kinds of executable files (like PE). If you’ve ever used Wine or Proton to run a Windows program on a Linux machine, you’ve used binfmt_misc↩︎

  5. It’s actually possible to compile these rules into a special binary format, so that YARA doesn’t have to preprocess them every time you run it. In the real world this can be a lot more efficient. To compile the rules, you would run yarac rules.yar rules.yrc, and then run YARA as yara -C rules.yrc /path/to/directory↩︎