Saturday, November 13, 2010

Linux Forensics: Pattern Matching wh Grep and Related Tools

Pattern matching is locating a given sequence within a pool of information. Everyone who has used Google knows in essence what this is and the importance of refining search terms to weed out unnecessary information from the vast sums available on the Internet. This analogy is applicable to forensic investigations involving digital evidence; it is desirable to avoid the clutter of unwanted information. The benefits of pattern matching are two in number: to increase productivity and the likelihood of finding desired information. A synopsis of regular expressions and an exploration of their importance and efficacy regarding those ends follows; their use is applied with tools common to most GNU/Linux systems. Ancillary topics include network forensic tools and scripting, the latter of which seeks to provide analogous functions between the tools discussed and competing forensic software.

Introducing Pattern Matching:

A term intertwined with pattern matching is 'regular expressions'. They're synonyms in essence, with the former denoting the action of locating a desired occurrence in a larger data set, and the latter denoting the language by which this is often accomplished.I II Another, potentially inaccurate, synonym for pattern matching and regular expressions is “grep.” The word “grep” goes back to Unix in which editors like ed which phrased search and print functions like g/re/p, wherein “re” is the desired search pattern, and it would print the result to standard output.III Forward closer to the present day, and grep is less used in such a specific context; it now means approximately to find a given pattern. Specifically, grep is one of many programs that use regular expressions (the language of pattern matching). Alternatively, it is oft used as a verb to connote this action. This paper will make liberal use of the word in the spirit of grep's use in colloquial English.

Over time regular expressions became diversified into a multitude of different camps, of which about a dozen are reasonably popular at the present day. Some are as followsIV:

It is important to note that since these were developed somewhat independently, one should not trust on the fact that regular expressions for one tool will work with another, unless said tool is explicit in stating the standard being used. For instance, FTK and EnCase, use syntax similar to Perl. Without such knowledge, one may assume a pattern in one (grep with BRE syntax) would apply to the other, and evidence may be passed over because of such an error.V Though set standards for regular expressions exist,VI VII derivations from a given standard to incorporate aspects from other standards or to add additional functionality may be present and the lack of such should not be assumed barring the explicit declaration of software providers that given tools conform strictly to set standards.

A simple example of such a difference between different regular expression standards would be the pattern [a-z]\{3\} using the Perl and POSIX BRE engines. The POSIX BRE engine would match a string like “abc”, while the Perl engine would match something like “b{3}” literally. This is one of many differences between the engines that are available—because of this, it can be helpful, at least initially, to focus primarily upon one style of regular expressions, adjusting them when necessary, rather than attempting to explore the nuances of each in turn.VIII

Perl-style syntax allows the search of non-printable characters.IX Secondly, support for Perl regex is widespread, probably more so than any other regex engine. The GNU grep utility discussed in a later section has a -P switch signifying Perl syntax for the regular expression, saving the frustration of dealing with an entirely new syntax. Also, transitions from Perl syntax to POSIX BRE is both less likely to be necessary and perhaps easier than the opposite. The preponderance of tools explored in a later section of this paper have shared support for the Perl syntax as well. In the effort to make this paper easier to understand, non-Perl syntax will be eschewed when possible.X XI

Keeping this in mind, consider for a moment the regular expression syntax of the most popular engine at the moment, Perl.XII Perl is a scripting language, similar to PHP, most commonly tied to server-side scripting, dynamic web page generation, and a close relationship with MySQL.XIII PHP uses Perl syntax. On many websites, data is entered by the customer and sent to the server. If this data is not in the appropriate form when said data reaches the server, PHP can alter said data via three sets of functions: the preg group, the ereg group, and the mb_ereg group.XIV Of these, only the preg group will be discussed,XV and it is not even necessary to know either scripting language to comprehend said languages' regex capacity.

The function preg_match('/cat/',$string) would search for the phrase “cat” within the string $string. The single-quotes embody the regular expression, and the forward slashes act as delimiters:

echo $res;

The result to the terminal would be “1.”

A slightly more complex expression might be cat|dog, where the expression matches either the phrase “cat” or “dog.” This is a very useful feature called “alternation,” the use of which will be shown later for searching for a number of different patterns at once.

Applied uses of regular expressions:

In a multitude of books available on the subject of regular expressions, as the book progresses further towards the conclusion, the example expressions seemingly continue to advance further and further in complexity. This is an example of a complex expression:XVI


The expression captures dates, times, and datetimes, including leap years. While this is a very comprehensive pattern and excellent intellectual exercise, the most useful and helpful regular expressions may be much less complex.XVII Additionally, the more complex the pattern, the more likely it is to fail, both on account of user error and the restrictiveness of the search pattern. Keeping this in mind, it is more useful to start with simplistic patterns and refine towards more restrictive ones than vice versa.

Knowing how to tweak regular expressions is more valuable than having a seemingly infallible set of regular expressions to fall back on; despite the advanced features of matching synonyms and fuzzed spelling in FTK, there are instances in which these fail and custom-made patterns are necessary.

What follows are examples of composed regular expressions and the application of several expressions in a forensic context.XVIII As well,this paper branches out to include specific instances of the utilization of regular expressions and pertinent information surrounding the use of grep in the context of Linux-based forensic investigation.XIX There will obviously be far fewer example regular expressions than could have been incorporated into such a paper, being as the number of expressions possibly relevant is limited only by the imagination. These were primarily withheld on account of a desire for a reasonably terse discussion about regular expressions in particular instances—books have been written on the subject which might serve to better elucidate readers of different expressions of pertinence. The references for this paper serve as an excellent guides specifically for regular expressions, as well as accompanying topics such as procuring forensic images with Linux, for any issue deemed by readers to be covered in insufficient detail.XX

Introducing Grep:-

The grep tool&#039;s usefulness comes from its ability to sift through data sets to match a pattern, making it well suited for forensic work.XXI Two common (not necessarily forensic) uses are as follows:

ps -e | grep “ge”

This prints all processes (ps lists processes to standard output) that have “ge” in the process name.XXII

cat /var/log/messages | grep "fail"

Prints the file /var/log/messages to standard output. This however is redirected with the &#039;|&#039; (pipe) as standard input to the grep program. Grep prints out the lines matching the pattern “fail.”

Grep can be a capable tool in an examiner&#039;s toolkit, especially if live analysis is desired on a Linux system. Since grep is very likely already present, it may as well be used.XXIII Exploring the implications of live analysis is beyond the scope of this paper, but note that using grep on a machine on which it already exists would likely alter little as opposed to the introduction of novel programs to a system.XXIV

Going back to the examples of grep&#039;s usage above, the pipe operator is frequently used; the pipe symbol signals the shell to direct the standard output of the first command and use such as the standard input of the second. Knowledge of standard streams/file descriptors is required to understand the full implications of this. Most of the requisite understanding of such can be gathered from online sources.XXV

Concerning file descriptors, grep&#039;s output is easily redirected to a file for later review.XXVI Frequently in examining a case, the output would be better read to a file. This is easily done, as shown:

grep “greed” ./* > file 2> err

The &#039;>&#039; symbol redirects this data to a file for subsequent examination. The &#039;2>&#039; directs error messages (e.g. “Warning: recursive directory loop”) to a different file. If you do not care about the errors at all, direct 2 to /dev/null. Many errors are helpful in discerning why a particular search is not working as expected, but it is possible as has been illustrated to separate error messages from ordinary output, both of which are, by default, written to the terminal.

Another terminal trick is as follows:

grep “greed” ./* &

After this, pressing enter will return the user to a command prompt. It is possible via such to run multiple searches at the same time (it is recommended to combine this with redirection to a file). Typing “fg” will bring this background job to the foreground once again. This assumes the use of Bash; for other shells consult the documentation for similar functionality.

Concerning the topic of the three major forms of grep--grep, egrep, and grep -P—the last will be and should be used most frequently. The reasons for this are several. First, grep by default uses POSIX BRE syntax, which varies significantly from grep -P in that special characters must be escaped. This ensures for more cross-compatibility between regular expressions composed on the Linux command line and tools such as FTK. Next both grep and egrep do not support searching for non-printable ASCII characters such as spaces via \x20. Lastly, the selection of the Perl syntax with grep allows for alternation, which is supported under egrep as well but avoids the cross-compatibility issues.

Building expressions:

The following illustrates some simple searches with grep using patterns that may be forensically pertinent. Worth mentioning is that it may be helpful to experiment with expressions as opposed to simply reading of them. In EnCase, you may utilize the keyword tester (available in the tab for keywords when you make a new keyword).XXVII The following examples shall be formatted for the grep utility bundled with many Linux distributions—downloadable for no cost from many websites.XXVIII For the most part these examples may even be done via the use of a Live distribution—a bootable cd/dvd. The Bash shell is assumed.XXIX

The following grep will capture all jpeg photos in the current directory:

grep -P “^\xFF\xD8\xFF” ./*

The -P switch tells the grep program to use Perl syntax, followed by the pattern of hexadecimal characters (using the anchor &#039;^&#039;, notably), and then the search path, which is all files in the current working directory. It is worth mentioning that something to this effect is done with forensic software that categorizes files via signature values—this is done via pattern matching as well.

Exif metadata in a forensic investigation may provide interesting and possibly crucial data pertinent to an investigation and serves a good example for something easily locatable with regular expressions. Typical attributes present in Exif metadata include camera make and model, date and time information, camera settings, picture thumbnail (oft utilized for display on a camera screen).

Some new, high-end camera models actually incorporate a feature called geolocation, which tags photos with information about the locality of the picture.

Exif metadata is typically distinguishable from a typical picture by ASCII text subsequent to the file header.XXX With jpeg files, a regular expression can be constructed to determine which files may contain Exif metadata and which don&#039;t:XXXI

grep -P "^.{6,30}Exif" ./

FTK and EnCase do not contain the capability to sort images based on this determinant.XXXII

Assuming a series of files are found pertinent to a given crime or circumstance, this may lend investigators the cause to search for and seize digital equipment not specified or justified in an initial search warrant.XXXIII

The following expression matches a large number of email addressesXXXIV:


A sample of grabbing an IP address with pertinent limitations: XXXV


An alternative form of this without the limitations of each octet ranging from 0-255, decimal, might be found in the following:


This would match IP addresses, but not have the added benefit of weeding out IP addresses like 400.600.800.900, which are impossible. Also, the &#039;\b&#039; word boundaries will not work if there is a larger string within which an apparent IP was found. E.g. 123.456. It will match on this; one solution would be to do something like this:

grep -rP “[^\d\.]\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}[^\d\.]” ./*

The following grep search uses the /dev/ entry and treats the entire device (in my case a partition on a USB disk) similarly to that of a single file. Thus, such could be utilized to comb through either deleted files or file slack:

sudo grep -abP "hiddendata\!" /dev/sdb3

The -b switch will print out a byte offset. In this case it&#039;s very useful to have being as its a whole partition to sort through.

This grep search was somewhat problematic on account of a possible bug with the -P switch, labeled as “experimental” in the man page of grep.XXXVI It serves as an example of the caution needed when testing expressions.

grep -P “baked(?!beans)” ./wordlist

To solve this issue, any of the following worked:XXXVII

grep -P “baked” ./wordlist | grep -v “bakedbeans”
grep -P &#039;baked(?!beans)&#039; ./wordlist
x=&#039;!beans&#039;; grep -P "baked(?${x})" ./wordlist

Keyword searches:

Regular expressions in any given case need to be flexibly adapted to fit the needs of the investigation at hand.XXXVIII An example keyword search might be as follows:XXXIX

grep -Pr "(torrent)|(h33t)|(tpb)|(thepiratebay)|(demonoid)|(mininova)|(waffles)|(what\.cd)" ./

This might be an example of a search conducted on an individual suspected of software piracy. The search terms, separated by alternations (the pipe &#039;|&#039; symbol), are names of common keywords pertaining to torrents, common file sharing tools notably not illegal in and of themselves, but commonly abused avenue for the sharing of illicit warez. The keywords can and should be adjusted pending the circumstances of the case.

The most will be said by far about this sort of pattern, as it is both powerful and flexible. The basic idea is to separate desired patterns in between alternations, so a match of any result will be seen. There is no feasible limit to the number of terms that may be searched for. In the effort to provide a means for the quicker development of searches using a large number of keywords, here is the source of a small php script designed to be run from the command line to hopefully facilitate the process:


$changed=str_replace(",","|", $data);
$changed="\"".$changed."\""." ";

$grepstring="grep -Prnb ";
if($argv[3] != NULL)
echo "\n".$g_query." > $argv[3]";
echo "\n".$g_query;

Where the basic syntax is “[scriptname.php] [inputfile] [searchlocation] [outputfile]”.

Consider this example:

php myscript.php input / outputfile.

This would run the script &#039;myscript.php&#039;, using &#039;input&#039; as the input file, searching through the directory &#039;/&#039;, and using output as the output file for redirection. The actual output of the script would be as follows:

grep -Pr "torrent|h33t|tpb|thepiratebay|demonoid|mininova|waffles|what\.cd" / > outputfile

For the input file, simply make a comma-separated file of the keywords. This sort of script is simple, and not perfect, but it works for reducing the workload on large or frequently used keyword-search type grepping. It should be mentioned that grep without the -P switch can do this with a newline-separated file, specified with the -f switch. The Perl syntax (-P switch) doesn&#039;t allow for this, however, necessitating the php script to shorten.XL This was tested on sets of input keywords as large as 1411 different alternations. Regarding the speed differences between a search of 1411 alternations and one with many fewer, the speed differences were 0.0086250 seconds per alternation for a search with some ten alternations, and .004123317 seconds per alternation with a search with 1411 alternations.XLI While speed concerns are not a primary aspect of this paper, these preliminary benchmarks seem to indicate grep&#039;s efficiency in handling large numbers of alternations.XLII The script could also append to a log-file of grep expressions.

It is not ideal, much could be added and changed.XLIII More special characters could be escaped in the same way that the &#039;.&#039; character is already.

Grep and Packet Sniffing:

Grep with redirection (recall the discussion of standard streams) can be useful for several applications. The first reason, that has already been mentioned, is that the &#039;>&#039; character can be appended to a grep command to write the output to a file. The &#039;>>&#039; character can be used to append a later search onto the end of an earlier one.

Another viable use of grep could be to combine packet sniffing with a grep of the data. The command tcpdump is a tool also commonly found by default with Linux systems—no additional software is typically required—and this tool lets a user elevated privileges (typically) put an interface into promiscuous mode, looking at all of the traffic as opposed to only the traffic that is destined for the host.

Detailed information on tcpdump can be found on the man pages. Here is an example that will sniff payload data and write the the data to a file (called “data”):

sudo tcpdump -vvv -s0 -wdata

The &#039;-vvv&#039; switch controls the level of verbosity. On a typical DSL line running at 1.5 Mbps, the traffic generated by even a very short session of sniffing can often reach many thousands of packets (by very short I mean a few seconds). After dumping an adequate amount of traffic, Ctrl-C will stop the sniffing and return you to the prompt. Grep can then be used for searching through the captured data, as follows:

grep -aP "" ./data

The -a switch here is used to tell grep to treat the file as text, and print out matching lines. Data captured in this way is frequently marked as binary data, in which case grep will not print out matching lines by default. The pattern might be used in an instance wherein a person has been suspected of plotting a crime (likely murder or an analogous crime in this case). This is quite rudimentary and only should be used as an example; a real case should account for permutations and synonyms of search keywords.

The method has significant limitations, the primary one being that tcpdump merely dumps data, it does not have built-in functionality to decode data. Examining tcpdump&#039;s output will reveal data passed from source to destination and vice versa without any concern for whether or not such a format is in human-readable form.

Probably the most desired traffic is going to be web traffic—oftentimes traffic is left essentially out in the open for easy sniffing, often even with somewhat sensitive information being passed.XLIV The headers can reveal whether or not traffic destined from a given destination will be privy to easy observation via the use of tcpdump or not. Take the following two examples:

Encoded data:

HTTP/1.1 200 OK..Cache-Control: private, max-age=0..Date: Fri, 19 Feb 2010 05:12:16 GMT..Expires: -1..Content-Type: text/html; charset=UTF-8..Set-Cookie: SS=Q0=bmlnZw; path=/search..Server: gws..Transfer-Encoding: chunked..Content-Encoding: gzip

This was generated with a client that attempted a google search. Google gzips traffic, so searching for plaintext keywords in a grep will be fruitless for the payload of HTTP packets. Presumably this is done to save bandwidth from unnecessary traffic. Contrast such with the output of this header:


HTTP/1.1 200 OK..Date: Fri, 19 Feb 2010 05:11:39 GMT..Server: Apache/2.2.10 (Fedora)..Last-Modified: Thu, 18 Feb 2010 14:12:47 GMT..ETag: "4c07c-213-47fe08fb135c0"..Accept-Ranges: bytes..Content-Length: 531..Connection: close..Content-Type: text/html;

This would be an example of traffic to a site that does not employ gzipped encoding. The use of tcpdump with such a site would suffice.

It is important to note that gzipped encoding is not synonymous with encryption—tcpdump simply lacks the capability of dumping traffic in a form other than that which is passed along the wire.
If decoding traffic is necessary, tshark, the command-line counterpart to Wireshark, is a viable alternative. The following form of the command dumps fully decoded packets to the file “data2”:XLV

sudo tshark -V -s0 > data2

Wherein searches would be performed against the &#039;data2&#039; file. To users familiar with grep this can be significantly more effective than using Wireshark to accomplish the same thing.

To reiterate the point concerning gzipped encoding, tcpdump suffices when circumstances do not require dumping the full contents of packets. When full packets are required—e.g. to rebuild what a suspect was basically presented with at a given page—tshark is a much better choice.XLVI Tshark is also preferable to tcpdump for grepping network traffic for aforementioned reasons. Though neither will pick up encrypted traffic, tshark is able to decompress encoded traffic, allowing the use of grep.

In instances where web traffic is desired, often the desired output will be located in a section “Line-based text data: text/html,” so using grep is not necessarily mandatory, but the -b switch with a quick grep search may be helpful in locating which section of the file deserves examination. Another method to cull data would be to specify a capture filter, such as “-f “port 80””.XLVII

It&#039;s left as an open question as to the specific instances wherein network forensics may come into play. Often, since warrants are served on crimes long since committed, it&#039;s likely that an investigator wouldn&#039;t need to sniff data off the wire whatsoever. It is a useful tool regardless, if not for the average investigator, then for the systems and network administrators.XLVIII

The Find command:

The find program can be used to search for specific types of files. The following searches for SQLite files (as identified with their common extension):

find /home/ -name “*.sqlite”

SQLite files can often contain forensically pertinent information; one notable mention is that Firefox stores a treasure trove of information in SQLite databases. Some of this information includes downloads, form history, bookmarks and browsing history. By default this is stored under the .mozilla folder in the user&#039;s home directory. The dot signifies that it&#039;s hidden, it won&#039;t show up to a ls unless the the -a switch is applied to ls when looking at a home directory through a live shell.XLIX

The following is a more complex example of find&#039;s capabilities.

find /home/ -type f -mtime -1 -name "*.exe"

In turn, the switches dictate to print those files (type f) modified (the &#039;m&#039; in &#039;mtime&#039;) up to a day ago (-1) whose name ends in “.exe”. On a side note, files with .exe extensions are a rarity on Linux filesystems, and may even be a cause for suspicion in some instances. They are, however, becoming more common with the popularity of wine.L

Here is another advanced form of find:

find . -name "*.png" -exec grep -lPa "^\x89\x50\x4E\x47\x0D\0A\x1A\x0A" {} \;

This time find is working on finding files in the current working directory with the apparent extension of “.png” and grep is testing to see if the files have .png file signatures.LI LII

One of the main advantages to using find is the easy of searching through additional levels of data such as file names.LIII The following command finds files with an apparent extension of .jpg in the /home directory:

find /home/ -name "*.jpg"

This search is recursive by default. Notice that in this case, the “.” symbol should be taken literally and not as a regex token for any character.

Find can separate who owns what (by owner or group):

find ./ -user root
find ./ -group root

Print results with a stipulation of time (in this case, &#039;-mmin&#039; means anything modified less than thirty minutes ago):

find /var -mmin -30

Finds files with permissions set to 007 (does not match 657, for instance)LIV:

find ./ -xdev -type f -perm 007

This finds files which are r-w-x for world (the other bits do not matter):LV

find ./ -xdev -type f -perm -007

Finally, find works well with xargs:

find /home/toor -name "*.txt" | xargs grep -i "john doe" 2> /dev/null

The &#039;2>&#039; directs stream 2 (stderr) to /dev/null.LVI

Database and Directory Service Text searches:

The two examples that follow will be searching through a directory service (openLDAP) and a MySQL database; these are two specific examples of an almost infinite amount of permutations of specific circumstances that dictate investigating certain things. For example, a case involving suspected child pornography would have a definite emphasis on multimedia-based searches. Investigating a cracker would involve keywords surrounding such a subculture, and an investigation into piracy would involve searches tailored for such. These examples serve as a guide for how to treat cases with unique circumstances—patterns come secondary to knowing the ins and outs of how these services function.

Forensics of this sort are broadly classified as “database forensics,” and deserve a significant amount of dedication to fully appreciate what such a term entails. Books have been written on this topic, rightfully so. This paper is merely the tip of the iceberg about what may be said concerning database forensics—those wishing for more may consult the cited references.

Grep and find together can uncover a significant amount about a database or directory service. This becomes increasingly helpful with an increase in the amount of data. MySQL will be discussed first.

MySQL has differing storage engines that determine whether or not the following searches will even work. Differing formats will require differing searches. The one format that will be considered is the MyISAM storage engine and the .MYD and .MYI files. This format is purely chosen as a suitable example of finding data related to a MySQL database. Other engines and files types may be as follows, depending on the circumstances: .MRG (MERGE), .ibd (indexes and data for InnoDB), .CSM and .CSV (comma-separated), and .ARZ and .ARM (ARCHIVE).LVII

A simple command such as the following will suffice to track down most locations of pertinence to finding database files:

sudo find / -name “*.MYD”

It may be necessary to run this as a privileged user. Locations in which MySQL database files are held are frequently under the ownership of “mysql,” “mysql” group, and as such will not be accessible to non-privileged users. If this concept seems foreign, find information pertaining to file vs. directory permissions.LVIII Alternatively, one could alter the permissions of files and directories recursively:

chmod -R o+r,g+r,a+r ./dir

This would not be ideal, as it would alter finds based on file/directory permissions. Better to run all searches on restricted directories as a super-user.

There are three main types of files related to a single table in a database: .frm, .MYD, and .MYI. Typically these files are prefixed with the name of the table, such as table1.frm. The main value (to the human eye) of .frm files is that they list the column names of a table. .MYI files are indexes and do not allow for ease of grepping data therein (mostly non-ASCII characters). .MYD files are the main table files; they contain all the data held in a given table.

MySQL tables are frequently built using batch mode scripts that take administrator input in creating the table and the mysql program reads it in as if it were typed on the command line. One possible avenue for tracking these scripts down (there is no definitive trait of their filename or extension, though .sql would probably be something to try if possible) is to search for data likely to be present in such a file:

grep -Pri "CREATE (TABLE|DATABASE)" /home/

If table backups are desired, something to the effect of this would suffice:

grep -Pr "\-\- MySQL dump" /home/

This captures the typical output of the utility mysqldump, a common tool to dump a batch script for the backup of a database.

These aforementioned searches basically allow for a determination of whether or not a database exists, and if this is so, recovering perhaps some of the data. More complex applications might be recovering log files of transactions to recover and/or reverse altered/deleted fields.LIX This is beyond the scope of this paper as simple search tools cannot provide the sort of functionality by which to do this.

Directory Services:

Directory services are not the usual suspects for a forensic investigation, but given their sparse mention in the literature of the craft, it is useful to discuss such here to serve as an example of pattern matching for an unusual target.LX The directory service employed herein is openLDAP with slapd; other services differ but overall the commonalities should outweigh these.LXI

Assuming nothing is known about a directory service beyond the fact that it exists (perhaps not even that) on seized server media, likely the most fruitful search would be to use find to segregate any files with an extension of .ldif. LDAP Data Interchange Format (LDIF) files are commonly used as a form by which to load new entries into a directory via the use of a tool such as ldapadd. With openldap the configuration files are typically held (in Debian) under /etc/ldap/. The databases themselves are stored in a binary format elsewhere, under /var/lib/ldap. Files and logs stored in these locations are about as readable as a binary executable with strings of code interlaced with ASCII text.

Grep makes short work of locating specific entries within files once these files are discovered. Barring prohibitive file/directory permissions locating a known entry is no more difficult than including a keyword as a pattern, such as the following:LXII

grep "ou=xyz,dc=site,dc=com" ./input.ldif

A grep of the distinguished name typically works, as the distinguished name is written often in plain-text in files associated with ldap services.

grep -r “dc=site,dc=com” /var/lib/

And the following will capture files with a particular pattern and copy matches to a particular destination:

sudo grep -lr "dc=home,dc=com" /var/lib/ldap/ | xargs sudo cp -t /home/user/Desktop LXIII

sudo find /etc/ -name "*.ldif" | xargs sudo cp -t /home/user/Desktop/ldiffiles/

OpenLDAP and LDAP (and MySQL for that matter) are not commonly employed by the average user; there is not much documentation available for directory service forensics—until such a time wherein they are more commonly used and encountered in forensic investigations, directory service forensics is mostly a novelty; however, these principles are applicable to other, more viable, forensic applications.

Other Instances of Grep:

Though grep as a word is primarily denotational of the program, it has as well come to connote the general actions of finding information. Grep is as much a noun as it is a verb; additionally, many related programs have adopted naming conventions which are amalgamations of this word and that adhere to the original program&#039;s spirit.

A few noteworthy programs are as follows:LXIV

ngrep: network grep, searches network trafficLXV
sgrep: searches for structured patterns using region expressions
pcregrep: grep that uses PCRELXVI
ext3grep: grep-like program designed to assist recovering data from EXT3 filesystems
agrep: this program stands for “approximate-grep,” and allows for a number of errors in the search pattern (fuzzy spelling)
beagle: provides indexing featuresLXVII

Foremost: Carving and Sorting

Until this point, grep has been used to sort through files allocated on a disk. Deleted or otherwise unallocated files have been neglected. Foremost is a tool that allows files of numerous sorts to automatically and effortlessly be exported from a dd disk image into another folder for easy viewing, separated by file signatures. Foremost&#039;s invocation in its simplest form is seen in the following:

foremost -i image.dd -o image.dd.folder

After processing has completed, changing directory into the image.dd.folder will show folders separating files by file type. If the standard signatures are insufficient for a particular sort of file, additional ones may be added in the /etc/foremost.conf file. Help is displayed in typical command-line fashion, with the -h switch.LXVIII

Simple Forensics Scripts:

Scripting common searches into an executable file provides an easy method for quickly processing media in a controllable fashion. In Bash on Linux (as well as other shells of course) typed commands can then be strung together an ran in analogous fashion to a program, wherein each line is essentially equivalent to a typed line on the terminal.

The following code is an example that accomplishes some basic forensics tasks.LXIX


echo "Example forensic script. Copies .png and .jpg files to specified directory. Verifies file signatures. Location to be searched passed at command argument 1."

read pause

echo "working"

# find files with a .png extension and see if they contain a png file signature.
find $1 -name "*.png" -exec grep -Pl "^\x89\x50\x4e\x47" &#039;{}&#039; \; > ./picslist
# do the same to apparent jpg files. Append matches to the file picslist
find $1 \( -name "*.jpg" -o -name "*.jpeg" \) -exec grep -Pl "^\xFF\xD8\xFF" &#039;{}&#039; \; >> ./picslist

# grep for patterns in the location specified by $1 (command argument 1), output results to a file.
grep -Pr "1337.haX0Rz||murder" $1 > keyword_results

# find files modified within 10 daysand write results to a file.
find $1 -mtime -10 > modified_file_list

echo "complete"
# process results for display in browser via php script "sort.php"
php sort.php > test
firefox test
#optionally, remove temporary files
#rm picslist
#rm test

The accompanying sort.php file:


//picslist has list of all picture paths.
$file = "./picslist";
$handle = fopen($file, &#039;r&#039;);
//$data = fread($handle, filesize($file));
$file2 = "./modified_file_list";
$handle2 = fopen($file2, &#039;r&#039;);

echo "<html>";
//process each path and print link to picture
while (fgets($handle) !== FALSE)
$data = fgets($handle);

echo "<a href=\"".$data."\"> Path: ".$data."</a><br></br>";

echo "<img src=\"".$data."\" height=\"100\" width=\"100\"></img><br></br>";

echo "<h1>Modified file listing:</h1><br></br>";
//this code does as above but with links to each of the modified files
while (fgets($handle2) !== FALSE)
echo "<a href=\"".$data2."\"> Path: ".$data2."</a><br></br>";

echo "</html>";


The power of scripting comes from automating anything that would typically be done by hand otherwise. Another easily automated task might be commands to make a forensic image:


dd if=$1 | split -d -b 700m - image.

cat image.* >> $2

This would take a specified device, image it in 700 MB chunks (unnecessary but helpful for burning to discs), and then concatenates the chunks into a single full image.LXX

The following would be a brief continuation of the former, making an image, hashing the result for verification, and mounting the resultant image to a folder and doing a grep search on it:


dd if=$1 | split -d -b 700m - image.

cat image.* >> $2

#dd if=$1 of=$2
cat image.* | md5sum > $2.split.md5
md5sum $2 > $2.md5

mkdir mounted

sudo mount $2 -o loop -oro ./mounted

grep -Prl "warez|piracy|torrents?" ./mounted/ > $2.grep.result

foremost -i $2 -o $2.output

nautilus $2.output

The $1 and $2 signify command arguments. After doing a chmod on the script file to allow its execution, typing in something to the effect of ./ command1 command2 runs the file “” in the current directory, with the “command1” and “command2” passed to the $1 and $2 in the script respectively.

This creates a raw dd image comparable to FTK Imager&#039;s raw image (the two resultant images can be verified to be the same). The image is mounted and then grep is used to search through the mounted image. Foremost runs after the grep search, and the folder is opened via nautilus for each viewing. Expansions/revisions upon this can and should be added per case requirements. Should circumstances necessitate compression, this can be accomplished with the likes of gzip, bzip2, tar, or similar utilities. Using gzip is as simple as “gzip [image name],” whereafter the image will be named [image name].gz when possible.LXXI

Linux as a forensics platform:

Hopefully by this point it has been shown that many aspects of forensic investigation can be done via the use of a no cost operating system, including imaging, file carving and exportation, keyword searching, and sorting by file types.LXXII Many versions of Linux can be had for no monetary cost, and the freedom to tweak and adjust aspects as needed are of a significant benefit especially in forensic investigations involving unique circumstances.LXXIII Proprietary firms that make and distribute forensic software are swayed principally by monetary concerns can conceivably leave investigators out to pasture if the latter&#039;s needs are not matched by the goals of the former. The power and control over open-source tools allows for modifications and advancements beyond the concerns of closed-source software.LXXIV Brian Carrier also argued that open source tools more effectively meet the criteria for forensic evidences&#039; admissibility per the “Daubert test.”LXXV

Linux however may have higher barriers to entry than does Windows, in which case it must be determined whether or not the costs of a windows system (and the accompanying Windows forensics tools) balance or are outweighed to the benefits of using Linux. This entry barrier is solely on a per-user basis given the preponderance of investigators primarily dealing with Windows.LXXVI

One of the criticisms of Linux involves mounting drives as read only. On face value, this can be accomplished easily with something such as the following:

mount -oro /dev/sdb3 /media/imaged/

However, a process called journal recovery with certain file-systems such as Ext3/4 and others may change the evidence. There is an option &#039;noload&#039; or &#039;loop&#039; that supposedly corrects this issue, but given the ease by which one may neglect to include it, and the ever-present concern of some unforeseen circumstance that might cause the kernel to write to the drive, it is prudent to use a hardware write-blocker.LXXVII LXXVIII

Another issue involves auto-mounted devices, such as USB drives and such. Typically, when these are plugged into most Linux systems, they are mounted without asking the user. Doing this with evidentiary media is a poor forensic practice in most circumstances. As mentioned, the best by far is to use a hardware write-blocker, but disabling processes that automount should work as well.LXXIX LXXX

An issue specific to grep is the lack of support for Unicode-16 and U-32, which shall become an increasingly large obstacle in proportion to the frequency of such encountered in investigations.LXXXI

There are some other criticisms to using Linux for investigations: Linux can&#039;t see the last sector on a device with an odd number of sectors.LXXXII But probably the most salient criticism of using Linux as the primary forensic medium for most is the higher barrier to entry given that you must learn a good deal of commands and how to navigate via a console instead of GUI-based tools. This is no longer fully convincing with tools such as Autopsy coming onto the market; though Autopsy lacks the flair of EnCase and FTK, it does many of the same things.LXXXIII The detriments need to be fully explored by any investigator desiring a transition from Windows to Linux forensics tools—one need be mindful that any different operating system will present different problems.

Despite these criticisms, benefits of Linux abound. The first is a greater familiarity with a different tool set. Linux is especially prevalent on high-end systems. Four-fifths of the world&#039;s supercomputers run LinuxLXXXIV, and live forensic analysis on one of these would be the worst possible time to acquaint one&#039;s self with the basics of grep. Knowing at least basic knowledge of Linux lends to a greater degree of competency with non-Windows OS, including Mac OS X, Solaris, and others.LXXXV LXXXVI LXXXVII

It would be remiss to neglect the costs of Linux versus proprietary alternatives. Due primarily to the relative easy by which forensics investigations may be conducted with a license of either EnCase or FTK, coupled with a demand for forensics investigators that would be otherwise estranged from the field in lieu of such a productLXXXVIII, a premium has been (perhaps rightfully so) charged for the use of their products in the form of hefty licensing fees. Though this paper only serves as a sliver of the material needed to match the intricacies of competing products, if a community effort were to materialize around forensically-oriented concerns, it is definitely conceivable that EnCase and FTK would have a competitor selling software at an extremely attractive price.LXXXIX Need has brought forth such software as GIMP (free alternative to Photoshop), OpenOffice (alternative to Microsoft Office), and thousands of others; analogous forensic software is less a fantasy than a probable future.XC

Lastly, it has to be asked whether the field of forensics is benefited by a pair of relatively monopolistic businesses. Though this is enough to ensure healthy competition to further improve one product over said product&#039;s competition, any enthusiastic programmer wishing to contribute to the effort is denied the opportunity to do so by the very nature of proprietary code. The arrangement at present primarily benefits the producers and not the users of forensic software.

Ultimately, the decision over which of these two competitors is better is left up to the reader&#039;s discretion. In the future, hopefully GNU/Linux will become more a competitor to Windows as a platform of computer forensic investigation. Regardless of whether or not Linux gains significant market share in forensic software, additional option will increase the pressure to optimize software with additional features to benefit end-users.

Grep command glossary:XCI XCII

grep : program that prints lines matching a pattern. Equivalent to grep -G, for basic regular expressions (i.e. BRE)
Egrep : &#039;extended grep&#039;, equivalent to grep -E
grep -P : grep using Perl syntax. Most uses of grep in this paper use grep -P.
-r : recursively search through folders.
-i : case insensitivity
-f : obtain patterns from a specified file (one per line)
-v : select non-matching lines (rarely used in this paper)
-c: output file and &#039;count&#039; the number of occurrences of the pattern
-a : treat all files as text. Use this to find data that may be hidden in binary files
-l : print name of file. Stops after first match.
-m [#]: stop reading a file after a certain number of matches
-n : Prints out the line number that matches the pattern
-A [#]: print # lines after a match
-B [#]: print # lines before a match
--exclude-dir=[DIRPATH]: exclude a directory. Useful for avoiding recursive loops.
-w : print all lines containing pattern as a word (the pattern &#039;eye&#039; would match &#039;eye&#039; but not &#039;eyelid&#039;)

Notes on Regex Symbols and Glossary: XCIII

Global: this term refers to an option by which multiple matches can be found in a given string/file. The tools mentioned in this guide are global by default. The opposite of this would stop after the first match.

Case sensitivity: determines whether or not a pattern such as &#039;google&#039; is matched in the data "gOOgle" or "GOOGLE" or not. With grep, the -i switch can enable case insensitivity, in which case the aforementioned example would match.

Extended: This is somewhat an ambiguous term. It can refer to ERE, extended regular expressions, as in POSIX ERE, or more generally, to a feature that ignores white space in the searched data.

Dotall: this determines whether the wildcard &#039;.&#039; will match newlines or not.

Multiline: most often pertinent in the scripting languages&#039; utilization of regular expressions, this determines the functionality of the anchors ^ and $, whether they are matched only by the start of the string and its end, or whether newlines will cause said anchors to match the start and end of each respective line.

Character classes:

. : matches any character
\w : matches any word character
\W : negation of \w
\d : matches any digit
\D : negation of \d
\s : matches a whitespace character
\S : negation of \S, any non-whitespace character

Character sets:
[\WxZ] : braces act as an OR statement, in which anything inside may occur for a single character. In said example, either \W, x, or z may be matched. May also be a range, such as [a-z], or a set of ranges, [a-z0-4]
[^abc] : matches a character that is not a, b, or c.

Special characters:

\t : tab
\r : carriage return
\n : new line/line break
\xAB : hex character (e.g. \x20 for a space, \x0A for a new line)

Characters which typically need to be escaped for literal match:

\, ., +, *, ?, ^, $, [, ], |, {, }, /, &#039;, #, (, )


\b : matches a word boundary, typically white space before and after words, or the start of a line
\B : negation of \b
^ : matches the start of a string*
$ : matches the end of a string*

*: discussed in this paper what precisely this entails


abc(?=afas): Lookahead. This would look for "afas" after the pattern "abc." "abc" would not be included in the result.

abc(?!afas): Negated lookahead. E.g. if afas is directly after abc, discard the result. XCIV

(?<=afas)abc: Lookbehind. Does the same as lookahead but looks before a given pattern. An example of this would be "afasabc". The lookaround pattern is not included.

(?<!afas)abc: Negated lookbehind. If &#039;afas&#039; precedes &#039;abc,&#039; discard the result.


? : makes the preceding character optional. Works on any token.
* : matches zero or more of the preceding token.
*?: matches zero or more. "Lazy" match, matches as few characters as possible
+ : Matches 1 or more of preceding token. Greedy, will match as much as possible.
+? : Matches zero or more. Alternative form of *?.
{3} : match preceding token exactly three times.
{10,12} : match preceding token 10-12 times.
{3-7}? : Match preceding 3-7 times. Lazy match, will match as little as possible.


(cat) : groups tokens together in a capture group.
(?:cat): groups tokens together, no capture group.

Capturing groups are a way of storing matched substrings that can be referenced later. These are mostly useful for scripting (e.g. with sed and other applications)—less so for searching a hard drive.XCV


| : the &#039;pipe&#039; character. Allows for the matching of groups. cat|dog matches &#039;cat&#039; or &#039;dog&#039; literally. To apply this within a larger expression, quotes may be used to separate groups. To match &#039;catog&#039; or &#039;cadog&#039;, the pattern ca(t|d)og would suffice.






V Regular expressions composed for tool should never be carted over to another without significant testing. Taking an EnCase regular expression keyword search of any significant complexity and using it on grep with POSIX BRE syntax would be disastrous. There may not even be a warning, and special characters would likely be taken literally. Evidence loss would be a likely consequence.


VII The PCRE manual, available via &#039;man pcre&#039;

VIII Another simple example of the various forms of regex is seen regarding delimitation. Expressions are often shown as “[abc]{3}”, [abc]{3}, and “/[abc]{3}/”, each of which may be correct or incorrect given the specific tool in use, even among a given standard (i.e. Perl regex). Because this paper principally deals with a few programs, many of these nuances are ameliorated, but their presence deserves mention.

IX This is not intended to imply that it is the only regex format to do so.

X Forensic article on Perl syntax in forensics:

XI Any notable exceptions, such as grep without the -P switch, use patterns that should be comparable with Perl, PCRE, and other formats.


XIII PHP has a number of functions that allow access and use of MySQL. PHP and MySQL do not by necessity need to be used together, and the use of one does not imply the other.


XV The reason for discussing PHP in lieu of Perl is solely due to authorial preference, due to the fact that though PHP uses PCRE, this is designed to mimic Perl syntax anyway.


XVII Most searches will probably be for keywords. Even assuming alternation, these are relatively simple to construct.

XVIII There will obviously be far fewer example regular expressions than could have been incorporated into such a paper, being as the number of expressions possibly relevant is limited only by the imagination. These were primarily withheld on account of a desire for a reasonably terse discussion about regular expressions in particular instances—books have been written on the subject which might serve to better elucidate readers of different expressions of pertinence.

XIX Henceforth I will mostly refer to GNU/Linux as Linux solely to conserve space and due to habit. See Free as in Freedom for a better understanding of this distinction.

XX Readers unfamiliar with grep and/or regex should see the glossary of terms and the synopsis of the grep manual at their representative sections.

XXI Grep doesn&#039;t search through free space/slack space unless you specify the /dev entry. Doing this is admittedly messy. For a tool that helps with this see:

XXII This search would be useless except on a live machine or for testing purposes. To stick to a consistent format, this will assume that searches are being done on a live system, as opposed to an imaged system.

XXIII Live analysis is beyond this paper&#039;s scope and should not be attempted without a full understanding of the risks involved.

XXIV Note that it would alter it somewhat. For instance, a history entry would be added to the .bash_history file of a live machine for each typed line in the Bash shell

XXV is one such helpful source


XXVII If one cannot afford EnCase, similar testing may be done with FTK or online at . Also mentionable is JGSoft products, especially Regex Buddy, which is extremely useful for developing regular expressions.

XXVIII See for one very popular distribution.

XXIX If these examples do not work, consider trying a downloadable live cd of Ubuntu Linux, on which these have been thoroughly tested. I am not familiar with the differences between Bash and other shells in depth.

XXX This is derived from experience and is not necessarily mandatory.

XXXI Due to the fact that I was unable to tease a clear answer concerning the precise location of EXIF metadata online, the broad range of 6-30 characters preceding its occurrence should suffice.

XXXII This feature was added to FTK 3.0 when “expand compound files” was checked in the preprocessing selection. I am still unaware of any such feature for EnCase.

XXXIII The reason being for this is that Exif metadata can be used to track down specific information tagged about the picture, such as the make and model of the camera. Frequently, these things are listed in readable format upon dumping the contents of a Exif-tagged image; utilities can parse out the less visible aspects, such as geolocation and timestamps.



XXXVI Fully, what the man page says is as follows : “-P, --perl-regexp experimental and grep -P may warn of unimplemented features. ”

XXXVII I&#039;m quite sure this is a bug. Perhaps one reason for labeling -P as “experimental” in the man pages.

XXXVIII A regular expression for credit cards would have little or nothing to do with crimes such as media piracy. Additionally, any investigator doing explicit searches for material not related to the warrant that justified a given seizure of assets may end up jeopardizing the admissibility of any evidence found therein. So if the reasons why serious investigators should know how to construct at least basic regular expression searches has not already become plainly evident, perhaps now it has. Searching for things not dictated by a search warrant endangers evidence admissibility.

XXXIX The quotes are actually optional for this expression.

XL This script is most helpful over repeated investigations. It can save a good deal of typing, depending upon the number of keywords, and since keywords are saved to a file, they&#039;re reusable.

XLI This was done on a small test file, and would likely change a good deal reflecting this variable. Again, this example is purely illustrative.

XLII The alternations used to get to such large amounts (i.e. 1400+) were repeated eventually. The best test would be to use all unique alternations, as grep may somehow parse out repeated (identical) alternations, but I don&#039;t see this reflected anywhere in the documentation.

XLIII If of value, my script can be used with or without attribution, and altered by anyone in any way

XLIV This continues to be a serious problem for innocent end-users, as well as potentially a huge boon for investigators, though the former is much more likely to realize it than the latter on account of a general lack of specific interest and procedures in network forensics, while sniffing is realized to the fullest by criminal elements hoping to find low-hanging fruit, network traffic transmitted in clear-text.

XLV An intriguing issue with this is that oftentimes long lines of source code (such as is expected in pages that do not separate lines frequently, but rather mash them together so as to partially obfuscate the reading of the source) are “[truncated]” under the section “Line-based text data:”. This issue appears to not easily be resolvable, and should be considered in cases where full fidelity source code is desired in sniffing.

XLVI Wireshark too is also excellent, and is perhaps even easier to deal with for users without an intimate knowledge of the console (it is much more popular, probably due primarily to the GUI). The primary reason why tshark is being discussed at the expense of the other is that in many circumstances a GUI will not be available; these are often eschewed for their unreliability on servers. Tshark will be the only alternative as Wireshark requires a GUI to operate. Installing additional programs on a live system is almost universally unacceptable in the typical forensic context. Assuming that it is permissible, wireshark may be installed by the command “sudo apt-get install wireshark” (on Ubuntu and Ubuntu based systems) or “yum install wireshark” (on Red Hat-based systems). These commands should resolve any dependencies as well. If in doubt install Wireshark on the non-subject system and sniff traffic via a hub.

XLVII Capture filters are undoubtedly one of the most important features of a sniffer. The example presented captures all traffic, but as such, the resultant file can quickly reach huge proportions. Capture filters help to disregard non-necessary data from being written to the output file.

XLVIII In-house forensics experts might frequently come into gathering evidence. The following details one such instance where they may be needed:

XLIX Hidden folders and other small facts are thrown in throughout this paper; though extraneous to the paper&#039;s primary focus, these things become crucial to the minority of readers I imagine could be ignorant of them.

L Wine being a software program allowing one to run Windows programs on Linux.

LI gives helpful information concerning the find command

LII The file signature for PNG images is taken from here:

LIII Commonly referred to as metadata, or data of data. A simple example of this would be to hash a file and change the file&#039;s name and verify it once again. The data is unchanged, and the hash remains the same. The metadata has been changed.

LIVInvolves the use of -xdev, useful for not crossing mount points. Beyond the scope of this paper, more on this is available here:

LV A better explaination of this:

LVI Much more could be said about this. Do a man on xargs to start. There are numerous resources available online as well.

LVII Taken from “Managing Tables and Indexes Part 1”



Fowler also wrote a book entitled SQL Server Forensic Analysis that deals with database forensics in depth

LX The typical directory service can be thought of as a white pages; specifically, directory services are utilized in instances where reading is given precedence to writing. Directory services are optimized databases of sorts for reading over writing. Many applications of directory services involve user accounts and information associated with such. Directory services employ the use of LDAP, also associated with single sign-on capabilities that allow a user to access disparate aspects (e.g. different areas that require authentication) of a set of systems without having to authenticate to each of them in turn.

LXI More specifically, the commonalities of LDAP syntax should carry over into other programs. File locations and other aspect of differing programs will be completely different.

LXII This exact search is exemplary only. Barring a very big ldif file, this search wouldn&#039;t be an effective use of one&#039;s time.

LXIII Good introduction to the xargs command.

LXIV These were taken from and is likely not a comprehensive list.

LXV This paper has demonstrated features roughly equivalent to this program.

LXVI Given that the -P switch is built into grep, the use of this tool was strategically neglected from the larger portion of this paper by reason of the greater popularity of the standard grep program.


LXVIII More information concerning foremost can be found via man foremost or at

LXIX As this is a forensics paper first, and a software development paper second, I fully expect my code to be unoptimized.

LXX I got the splitting of images in part from this article:


No comments:

Post a Comment

Do comment If you liked it...