Associative Arrays, Procedures, Strings and html package

  1. When an associative array is passed to procedure, it must use call by name, not call by value.

    A procedure using call by name must use the upvar command:

    Syntax: upvar ?level? varName localName
    Maps a variable from a higher level scope to the current scope.
    ?level? Optionally, the level at which this variable exists. May be a number up from local, or a #number down from global.
    varName The name of the variable in the other scope
    localName The name of the variable in the local scope

    The variables at a higher level in the procedure stack can be accessed with the upvar command.

    A simple example of a command using call by name is incr which can be implemented in Tcl like this:

    
    proc incr {varName} {
        upvar 1 $varName local
        set local [expr $local + 1]
    }
    set i 1
    incr i
    

    You can retrieve a list of an associative array's indices with the array names command.

    Syntax: array names arrayName ?pattern?
    Returns a list of the indices used in the array arrayName
    arrayNameThe name of the array.
    patternIf this option is present, array names will return only the indices that match the pattern. Otherwise, array names returns all the array indices.

    Write a procedure that will accept the name of an associative array and a value to add to each value in that array. This procedure does not need to return anything, since it will modify the array in place.

    Solution

  2. Associative arrays are useful for gathering related information.

    We can histogram a set of data with an array with indices for each value being counted and the count for a value.

    We've been using various techniques to analyze this message log file.

    To characterize attacks and activity, we need histograms of what activity happens.

    The guts of a program to read the log file and report the number of messages from each activity looks like this:

    
    set if [open "messages.1" r]
    set data [read $if]
    close $if
    
    foreach line [split $data \n] {
      # Get the name or name[pid] value from the line
      set name [lindex $line 4]
      # Separate the name and [pid] sections 
      set name [lindex [split $name {[}] 0]
      incr count($name) 
    }
    
    parray count
    

    Generates output like this

    
    count()        = 1
    count(dhcpd:)  = 1838
    count(last)    = 75
    count(named)   = 16919
    count(ntpd)    = 46
    count(sshd)    = 9389
    count(syslogd) = 1
    

    This is code is simple to just run in the global scope as a short script.

    To do serious reports, however, we want to histogram based on time and day, only report one action when the log file has multiple lines (as shown below), and perhaps do other task specific actions (record what user accounts are used for ssh attacks, etc.)

    
    Dec 20 05:26:04 bastion sshd[31716]: Invalid user test from 82.6.138.65
    Dec 20 05:26:04 bastion sshd[31716]: error: Could not get shadow information for NOUSER
    Dec 20 05:26:04 bastion sshd[31716]: Failed password for invalid user test from 82.6.138.65 port 55422 ssh2
    

    In that case, it starts to make sense to put the processing into a procedure, and pass the name of the array to that procedure.

    Rework the previous example to put the histogram generating code in a procedure.

    The procedure should look a bit like this:

    
    proc processLine {arrayName line} {
      ...
    }
    

    Solution

  3. One step to making a report from the message log better is to report the number of events, rather than the number of lines in the file. As you can see in the log extract above, some events generate multiple lines.

    Events that invoke a new copy of a daemon show the Process ID (pid) of the new process in square brackets after the name of the application: ie: sshd[12345].

    We can split the name and pid into a list with the split command like this:

    
      set cmd [lindex $line 4]
      set cmdList [split $cmd {][}]
    

    The code above creates a list like this:

    
    sshd 1234 {}
    

    The list elements can be assigned to separate variables with the lassign command like this:

    
      lassign $cmdList name pid
    

    If the cmd did not have a pid associated with it (the dhcpd: daemon does not fork a separate process to handle DHCP requests), then the pid variable will contain an empty string.

    If the pid variable is not blank, we can use that value to determine whether this is the first time we've seen an event.

    The simple way to solve this problem is to create a list of pids we've seen. If lsearch returns a -1 when we search that list for a pid, we know this is the first message for that pid, so we process the data.

    The list of PIDs will need to be persistent. It can't be a local variable in the procedure.

    We could pass a state variable to the processLine procedure, just as we are passing the count array, but for a variable like this it sometimes makes more sense to use a global variable instead.

    Another trick with a global State variable is that we may save different types of data in it. It becomes important to be certain that one dataset doesn't overwrite another. Using longer and more descriptive indices for the State variable helps prevent this.

    One last trick for tracking the PIDs. The lsearch command will become offended if we ask it to search a non-existent list. We need to initialize the PID lists to empty strings.

    We initialize the list with an array set command like this:

    
    array set State {pidList,sshd {} pidList,named {} pidList,ntpd {}}
    

    This works because we know (I looked) which commands in the message log append pid information to the name of the application. We can generalize this with the info exists command which we will cover later.

    Rework the previous exercise to:

    1. Use a global array named State
    2. Initialize the State array
    3. Maintain a list of PIDS in State(pidList,*) (Where * is the name of an application)
    4. Only update the count for distinct PIDs of if there is no PID.

    The output from this exercise should look like this:

    
    count()        = 1
    count(dhcpd:)  = 1838
    count(last)    = 75
    count(named)   = 338
    count(ntpd)    = 1
    count(sshd)    = 4081
    count(syslogd) = 1
    

    Solution

  4. It would be amusing to know the login ids that are being used in ssh attacks.

    The user name can be found in lines like this:

    
    Dec 20 09:11:27 bastion sshd[28930]: Invalid user admin from 196.28.53.12
    Dec 20 09:11:27 bastion sshd[28930]: Failed password for invalid user admin from 196.28.53.12 port 52084 ssh2
    

    We can find the position of the list element user with an lsearch, and only process the line more if that position is greater than element 8. This will skip the Invalid user entries while catching the Failed password entries.

    We can invoke a new procedure to count user names when we've got an ssh message with code like this:

    
      if {$name eq "sshd"} {
        processSSHLine count $line
      }
    

    Modify the previous exercise to invoke the processSSHLine procedure from the processLine procedure. In the ProcessSSHLine procedure, the code should check to see if this is a Failed password message, and if so, find the user name that failed, and update a count with an index like ssh-d,$userName.

    The parray command will run much longer now, and will end with lines like this:

    
    count(ssh-id,yoko)          = 1
    count(ssh-id,zabbix)        = 1
    count(ssh-id,zxin)          = 2
    count(ssh-id,zxin10)        = 4
    count(sshd)                 = 4081
    count(syslogd)              = 1
    

    Note that you can use upvar to pass an array from one procedure to another.

    Solution

  5. The results of the previous exercise would be more interesting if they were sorted by count, instead of alphabetically by name. This can be done with the lsort command (which we haven't covered yet.)

    Another technique for presenting histogram data is to "bin" the data: report all values between 1 and 10 in a group, all the values between 11 and 20 in a group, etc.

    This can be done with a for loop, a foreach loop, an if statement, the expr command, and the array names command.

    Using split and lassign will make the results prettier.

    Modify the previous code by replacing the parray command with nested loops that displays the binned counts of the names used to attack the system.

    The results might look like this:

    
    41 -> 50
      postgres 42
      backup 46
    51 -> 60
      info 51
    61 -> 70
    71 -> 80
      oracle 72
      guest 71
    81 -> 90
      admin 89
    91 -> 100
    

    Solution

  6. Tcl comes with a package for communicating with HTTPD servers.

    An application can load this package with the package require command:

    
    package require http
    

    The most often used commands in the http package are the http::geturl command to perform an HTTP transaction with the server, and the http::data command, to access the page returned by a http::geturl command.

    Syntax: http::geturl?key Value? URL
    Performs an HTTP transaction. May be a GET, POST or HEAD. Returns an identifier to be used to access data regarding this transaction.
    -query queryText Performs a POST operation
    -validate boolean Performs a HEAD operation
    -channel channelName Copy URL contents to a channel (open file) instead of saving the contents in memory.
    -headers headerText Adds extra headers to the transaction request
    -method methodName Forced request type - allows PUT and DELETE methods to be used

    Other commands include:

    ::http::config ?options? Configure proxy port, if not default.
    ::http::geturl url ?options? Retrieves the data from a URL. Supports GET, POST and HEAD.
    ::http::formatQuery list Formats a query for an HTTP POST command.
    ::http::status token Retrieve the status of the last query.
    ::http::data token Retrieve the data from the last query.

    Write a short Tcl script that will retrieve a Tcl question from http://www.cwflynt.com:10085/tcl/question and print the html page.

    Solution

  7. Parsing files as regular and well behaved as log files is easy with the list commands. An HTML page is not regular and is much less well behaved.

    The string commands work better for parsing an HTML page.

    Here are a few of the commonly used string commands:

    Comparison

    string compare str1 str2 Returns -1, 0 or 1 if str1 <, = or > str2.
    string match str1 str2 Returns 1 if str1 matches str2

    Position

    string first str1 str2 Returns the index of the first occurance of str1 in str2, or -1 if no str1 in str2.
    string last str1 str2 Returns the index of the last occurance of str1 in str2, or -1 if no str1 in str2.

    Character Length & Substrings

    string length str Returns the length of str
    string range str start end Returns the characters in str between start and end

    A page returned by the previous exercise looks like this:

    
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
       "http://www.w3.org/TR/html4/loose.dtd">
    <html>
            <head> 
              <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <title>TclP</title>    </head> 
            <body> <CENTER>Tcl Proficiency Questions</CENTER>
        <form action="answer" method=post>
    <input type=hidden name=questionID value=21><table border=1>
    <TR><TD><PRE><CODE>The Tcl open command</CODE></PRE><TR><TD><input type=submit name=answer value="0">
    <TD>Opens files for read access<TR><TD><input type=submit name=answer value="1">
    <TD>Opens files for write access<TR><TD><input type=submit name=answer value="2">
    <TD>Open a pipe to other programs<TR><TD><input type=submit name=answer value="3">
    <TD>Opens a device that accepts stream I/O<TR><TD><input type=submit name=answer value="4">
    <TD>All of the above</TABLE><CENTER>Tcl Proficiency Questions</CENTER>    </BODY></HTML>
    

    The important part of this page is the table. We can use the string first command to find the beginning and end of the table (denoted by <table border=1> and </TABLE>)

    Once we know the character locations of the start and end of the table, we can use the string range command to strip the page down to just the table.

    Modify the previous exercise to display only the table contents. Solution

  8. A row in a table starts with a Table Row marker: <TR>. Each element across a row is delimited with a Table Delimiter marker: <TD>

    The first step in parsing this table is to split it into table rows. We can put each row into a list.

    We can use a for loop to step through the string looking for <TR markers until there are no more markers in the list. The for loop would look like this:

    
    for {set tr1 [string first "= 0} \
        {set tr1 [string first "

    This initializes the tr1 variable to the location of the first >TR marker, runs as long as tr1 is greater or equal to 0, and resets tr1 after each loop.

    Inside the for loop, we will remove a row of data from the table variable, until there is no >TR field left in the table string.

    Inside the loop, the tr1 variable will have the start location of the >TR< tag. We can remove that portion of the data in the table variable with a string range command like this:

    
        set close $tr1
        incr close 4
        set table [string range $table $close end]
    

    We can then use the string first command to find the next <TR tag and the string range command to extract that portion of the string and lappend it to a list.

    Finally, we use the string range command to remove this portion of the data from the table variable.

    Modify the previous code to extract the table rows and make each row a list element.

    Print the list, which should resemble this:

    
    {<TD><PRE><CODE>What will this code do:
    set lst1 {a b c d e f g h i}
    set lst2 {1 2 3 4 5 6 7 8 9}
    foreach c [split "beef" {} ] {
      append str [lindex $lst2 [lsearch $lst1 $c]]
    }
    puts $str
    </CODE></PRE>} {<TD><input type=submit name=answer value="0">
    <TD>Display 2556} {<TD><input type=submit name=answer value="1">
    <TD>Display beef} {<TD><input type=submit name=answer value="2">
    <TD>Display 6552} {<TD><input type=submit name=answer value="3">
    <TD>Display 1234}
    

    Solution

  9. We can remove the leading HTML tags from a string with a for loop similar to that used to find the <TR tags.

    Write that procedure and invoke it as the lines are append to the rows list.

    Note that you'll need to use the string trim command to clean up any newlines or spaces that might be between tags.

    Syntax: string trim string ?chars?
    Syntax: string trimright string ?chars?
    Syntax: string trimleft string ?chars?
    The string trim command trims characters that match one of a set of characters from a string.
    trimReturns the original string, after trimming all characters that match chars from the right and left end.

    trimrightReturns the original string, after trimming all characters that match chars from the right end.
    trimleftReturns the original string, after trimming all characters that match chars from the left end.
    stringThe string from which to remove characters.
    charsA string of characters to remove from the original string. Defaults to the whitespace characters: CR, NL, SPACE, TAB

    Solution

  10. Add a stripTrailingHTML procedure to the previous code and call it after the stripLeadingHTML procedure.

    Modify the output to be cleaner using a foreach loop. The output from reading and parsing one of these pages should resemble:

    
    A set of square brackets will
    ---
       Group a set of values and allow all substitutions to occur
       Group a set of values, and only allow square bracket substitution
       Group a set of values and disable all special character processing
       Evaluate a set of values like a Tcl command
    

Copyright Clif Flynt 2009