狭路相逢勇者胜: 2007

Sunday, December 30, 2007

Punctuation (标点符号)

` back quote
' single quote
& ampersand
* asterisk, star
% percent sign
: colon
/ (forward) slash
\ backslash
" double quote
. dot, period(句号)
, comma
{} curly brackets
[] square brackets
( an open parenthesis
) a corresponding closed parenthesis
() parentheses
<> angle brackets
- hyphen
| vertical bar
^ caret
~ tilde

delimiter
singular and plural 单复数

Saturday, December 29, 2007

Perl

Practical Extraction and Report Language

Windows ActivePerl from ActiveState

Llama Book "Learning Perl, Fourth Edition, by Randal L. Schwartz, Tom Phoenix, and brian d foy".

In Perl, comments run from a pound sign (#) to the end of the line. (There are no "block comments" in Perl.)

#!/usr/bin/perl
On Unix systems, if the first two characters on the first line of a text file are "#!", then what follows is the name of the program that executes the rest of the file.

On non-Unix systems, it's traditional (and even useful) to make the first line say #!perl.

The perl interpreter compiles and then runs your program in one user step.

    $ perl my_program

When you run your program, Perl's internal compiler first runs through your entire source, turning it into internal bytecode, which is an internal data structure representing the program. Perl's bytecode engine takes over and runs the bytecode.

Perl uses a number or a string as a scalar interchangeably.

All Numbers Have the Same Format Internally: Perl computes with double-precision floating-point values.

exponentiation operator **

7.25e45 # 7.25 times 10 to the 45th power
2**3 # two to the third power

Literal strings come in two different flavors: single-quoted string literals and double-quoted string literals.

String values can be concatenated with the . operator
"hello" . ' ' . "world" # same as 'hello world'

A special string operator is the string repetition operator, consisting of the single lowercase letter x.
"fred" x 3 # is "fredfredfred"
5 x 4 # is really "5" x 4, which is "5555"

Perl automatically converts between numbers and strings as needed. How does it know whether a number or a string is needed? It all depends on the operator being used on the scalar value. If a numeric value is given when a string value is needed (say, for string concatenation), the numeric value expands into whatever string would have been printed for that number.

"Z" . 5 * 7 # same as "Z" . 35, or "Z35"

To run your program with warnings turned on, use the -w option on the command line:

    $ perl -w my_program

Or, if you always want warnings, you may request them on the #! line:

    #!/usr/bin/perl -w

Scalar variable names begin with a dollar sign followed by what we'll call a
Perl identifier: a letter or underscore, and then possibly more letters, or digits,
or underscores. Scalar variables in Perl are always referenced with the leading $.

You can give print a series of values, separated by commas:

print "The answer is ", 6 * 7, ".\n";

When a string literal is double-quoted, it is subject to variable interpolation besides being checked for backslash escapes.

Perl provides a delimiter for the variable name in a manner similar to the shell. Enclose the name of the variable in a pair of curly braces. Or you can end that part of the string and start another part of the string with a concatenation operator:

Table 2-3. Numeric and string comparison operators
Comparison	Numeric	String
Equal	= =	eq
Not equal	!=	ne
Less than	<	lt
Greater than	>	gt
Less than or equal to	<=	le
Greater than or equal to	>=	ge

----------

----------

 @lines = ; # read standard input in list context

When the input is coming from a file, this will read the rest of the file. But how can there be an end-of-file when the input comes from the keyboard? On Unix and similar systems, including Linux and Mac OS X, you'll normally type a Ctrl-D to indicate to the system that there's no more input. The special character is never seen by Perl, though it may be echoed to the screen. On DOS/Windows systems, use Ctrl-Z instead. You'll need to check the documentation for your system or ask your local expert if it's different from these.

Invoke a subroutine from within any expression by using the subroutine name (with the ampersand). Whatever calculation is last performed in a subroutine is automatically also the return value. So be careful when adding additional code to a subroutine since the last expression evaluated will be the return value. "The last expression evaluated" really means the last expression evaluated, rather than the last line of text. The parameter list must be stored into some array variable for the subroutine to use it, and Perl uses the array @_ for this purpose. The @_ variable is private to the subroutine. if there's a global value in @_, it is saved before the subroutine is invoked and restored to its previous value upon return from the subroutine.

By default, all variables in Perl are global variables. But you can create private variables called lexical variables at any time with the my operator.

my($m, $n) = @_; # Name the subroutine parameters

Nearly every subroutine will start with a line much like that one, naming its parameters.

"name" of the array in a scalar context to find out the number of array elements.

The my operator
The scope of a lexical variable's name is limited to the smallest enclosing block or file.

using my( ) with parentheses, remember that without the parentheses, my only declares a single lexical variable.

my $fred, $barney; # WRONG! Fails to declare
$barney my($fred, $barney); # declares both

Any new variable will start out empty: undef for scalars or the empty list for arrays.

use strict;
The use strict pragma tells Perl's internal compiler that it should enforce some good programming rules for the rest of this block or source file. It's better to use it from the start when it's needed. Perl will insist that you declare every new variable, usually done with my. Most people recommend that programs that are longer than a screenful of text generally need use strict. We encourage you to include use strict in your programs as often as possible.

if the subroutine has the same name as a Perl built-in, you must use the ampersand to call it. With an ampersand, you're sure to call the subroutine; without it, you can get the subroutine only if there's no built-in with the same name. Until you know the names of all of Perl's built-in functions, always use the ampersand on function calls

The least you can return is nothing at all. A return with no arguments will return undef in a scalar context or an empty list in a list context. This can be useful for an error return from a subroutine, signalling to the caller that a more meaningful return value is unavailable.

Perl has a shortcut for

while (defined($_ = <>))
{
print "I saw $_";
}

it won't read a line into $_ by default. It works only if there's nothing but the line-input operator in the conditional of a while loop; If you put anything else into the conditional expression, this shortcut won't apply.

foreach (<>) { print "I saw $_"; }
it has to read all of the input before the loop can start running since foreach needs a list context to work on. So we prefer while expression.

Input from the Diamond Operator < . >

Since the diamond operator is generally used to process all of the input, it's typically a mistake to use it in more than one place in your program. Remember, the diamond operator reads the input, but the input itself is generally found in $_ (by default).

The diamond operator will go to the next file automatically, much like what you'd expect from cat or another standard utility.

while (defined($line = < . >)) { chomp($line); print "It was $line that I saw!\n"; }

The diamond operator isn't looking at the invocation arguments. it works from the @ARGV array. This array is a special array preset by the Perl interpreter as the list of the invocation arguments. In other words, this is like any other array, but when your program starts, @ARGV is already stuffed full of the list of invocation arguments.

The print operator takes a list of values and sends each item (as a string, of course) to standard output in turn, one after another. It doesn't add any extra characters before, after, or in between the items.

Perl is interpolating an array into a string, so it puts spaces between the elements.

Generally, if your strings contain newlines, you'll simply want to print them:

    print @array;

But if they don't contain newlines, you'll generally want to add one at the end:

    print "@array\n";

This is the problem with the optional parentheses; sometimes, we humans forget where the parentheses belong. When there are no parentheses, print is a list operator, printing all of the items in the following list, which is what you'd expect. But when the first thing after print is a open parenthesis, print is a function call, and it will print only what's found inside the parentheses. Since that line had parentheses, it's the same to Perl as if you'd said this:

    ( print(2+3) ) * 4;  # Oops!

To print a number, generally use %g, which automatically chooses floating-point, integer, or even exponential notation as needed.

printf "%g %g %g\n", 5/2, 51/17, 51 ** 17; # 2.5 3 1.0683e+29

In Perl, printf is most often used for columnar data since most formats accept a field width. If the data won't fit, the field will generally be expanded as needed. A negative field width is left-justified (in any of these conversions). The %f conversion (floating-point) rounds off its output as needed and lets you request a certain number of digits after the decimal point. To print a real percent sign, use %%, which is special in that it uses no element from the list.

A filehandle is the name in a Perl program for an I/O connection between your Perl process and the outside world. That is, it's the name of a connection and not necessarily the name of a file. Once again, as with labels, the recommendation from Larry is that you use all uppercase letters in the name of your filehandle. Perl uses six special filehandle names for its own purposes: STDIN, STDOUT, STDERR, DATA, ARGV, and ARGVOUT.

Every program that runs on Unix (and many other modern operating systems) has an exit status, telling if it was successful. Traditionally, it is zero for success and a nonzero value for failure.

The die function prints out the message you give it (to the standard error stream, where such messages should go) and ensures that your program exits with a nonzero exit status.

  if ( ! open LOG, ">>logfile") {
die "Cannot create logfile: $!";
}

In general, when the system refuses to do something we've requested (like opening a file), $! will give you a reason (perhaps "permission denied" or "file not found," in this case). This human-readable complaint message will be available in Perl's special variable $!. But if you use die to indicate an error that is not the failure of a system request, don't include $! since it will generally hold an unrelated message left over from something Perl did internally. It will hold a useful value only immediately after a failed system request. A successful request won't leave anything useful there.

If you don't want the line number and file revealed, make sure that the dying words have a newline on the end. That is, another way you could use die is with a trailing newline on the message:

    if (@ARGV < style="font-weight: bold; color: rgb(204, 0, 0);">\n";
}

As a rule of thumb, put the newline on messages that indicate a usage error and leave it off when the error might be something you want to track down during debugging.

You should always check the return value of open since the rest of the program is relying upon its success.

The warn function works as die does, except for that last step because it doesn't quit the program. But it adds the program name and line number if needed, and it prints the message to standard error as die would.

that default filehandle may be changed with the select operator. Also by default, the output to each filehandle is buffered. Setting the special $| variable to 1 will set the selected filehandle (the one selected at the time the variable is modified) to flush the buffer after each output operation.

If you want to ensure the logfile gets its entries immediately, in case you might be reading the log to monitor progress of your long-running program, you could use something like this:

    select LOG;
$| = 1;  # don't keep LOG entries sitting in the buffer
select STDOUT;
# ...time passes, babies learn to walk, tectonic plates shift, and then . . .
print LOG "This gets written to the LOG at once!\n";

If one of the three system filehandles STDIN, STDOUT, or STDERRfails to reopen, Perl kindly restores the original one. how you could send error messages to a file rather than to your program's standard error stream

   # Send errors to my private error log
if ( ! open STDERR, ">>/home/barney/.error_log") {
die "Can't open error log for append: $!";
}

Hashes

the keys must be unique.

$hash{$some_key}

This is similar to what we used for array access, but here we use curly braces instead of square brackets around the subscript (key).

Accessing outside the hash gives undef.

To refer to the entire hash, use the percent sign ("%") as a prefix.

a hash may be converted into a list and back again. Assigning to a hash is a list-context assignment, where the list is made of key/value pairs. The value of the hash (in a list context) is a list of key/value pairs. The pairs won't necessarily be in the same order as the original list. The order is jumbled because Perl keeps the key/value pairs in an order that's convenient for Perl so it can look up any item quickly.

It's rare to do so, but a hash may be copied using this syntax:

    %new_hash = %old_hash;

In the Perl grammar, any time that you need a comma ( , ), you can use the big arrow (=>) instead.

The keys function yields a list of all the keys in a hash, and the values function gives the corresponding values. If there are no elements to the hash, then either function returns an empty list:

    my %hash = ("a" => 1, "b" => 2, "c" => 3);
my @k = keys %hash;
my @v = values %hash;

To see if a key exists in the hash, (whether someone has a library card), use the exists function,

The delete function removes the given key (and its corresponding value) from the hash.

$ grep 'flint.*stone' chapter*.txt

Don't confuse regular expressions with shell filename-matching patterns, called globs. A typical glob is what you use when you type *.pm to the Unix shell to match all filenames that end in .pm. The previous example uses a glob of chapter*.txt. (You may have noticed that you had to quote the pattern to prevent the shell from treating it like a glob.)

To match a pattern (regular expression) against the contents of $_, put the pattern between a pair of forward slashes (/) as we do here:

    $_ = "yabba dabba doo";
if (/abba/) {
print "It matched!\n";
}

All of the usual backslash escapes that you can put into double-quoted strings are available in patterns, so you could use the pattern /coke\tsprite/ to match the eleven characters of coke, a tab, and sprite.

The dot always matches exactly one character (except the newline). a backslash in front of any metacharacter makes it nonspecial. So, the pattern /3\.14159/ doesn't have a wildcard character. The star (*) means to match the preceding item zero or more times. .* will match any character, any number of times. The plus (+) means to match the preceding item one or more times. The question mark (?), which means that the preceding item is optional in that it may occur once or not at all.

Quantifiers: 1) * 2) + 3) ? All three of these quantifiers must follow something since they tell how many times the previous item may repeat.

grouping: /(fred)+/

A character class, a list of possible characters inside square brackets ([ ]), matches any single character from within the class. A caret ("^") at the start of the character class negates it. the first hyphen in /HAL-[0-9]+/ doesn't need a backslash because hyphens aren't special outside a character class.

Some character classes appear so frequently that they have shortcuts. For example, the character class for any digit, [0-9], may be abbreviated as \d. The \s shortcut is good for whitespace. It's the same as [\f\t\n\r ], which is a class containing the five whitespace characters: form-feed, tab, newline, carriage return, and the space character.

\d \w \s and the opposite \D \W \S
Any of these shortcuts will work in place of a character class (standing on their own in a pattern) or inside the square brackets of a larger character class. Another compound character class is [\d\D], which means any digit or any non-digit, which means any character at all. This is a common way to match any character, even a newline as opposed to ., which matches any character except a newline.

The shortcut is that if you choose the forward slash as the delimiter, you may omit the initial m. This is a shortcut for the m// (pattern match) operator. Since Perl folks love to avoid typing extra characters, you'll see most pattern matches written using slashes, as in /fred/.

Choose a delimiter that doesn't appear in your pattern. It's common to use curly braces as the delimiter.

Optional Modifiers /s /i /x. Those flags, may be appended as a group right after the ending delimiter of a regular expression to change its behavior from the default.

Anchors: ^ $ \b \B

The caret anchor (^) marks the beginning of the string, and the dollar sign ($) marks the end. you can use both of these anchors to ensure that the pattern matches an entire string. /^\s*$/, which matches a blank line. But this "blank" line may include some whitespace characters, like tabs and spaces, which are invisible. The word-boundary anchor, \b, matches at either end of a word. This is similar to the feature often called "match whole words only" in a word processor's search command. The nonword-boundary anchor is \B.

Matching against $_ is merely the default; the binding operator (=~) tells Perl to match the pattern on the right against the string on the left, instead of matching against $_.

The match variables

$1 $2 $3 ...
parentheses also trigger the regular expression engine's memory. The memory holds the part of the string matched by the part of the pattern inside parentheses. If there are more than one pair of parentheses, there will be more than one memory. Each regular expression memory holds part of the original string. Since these variables hold strings, they are scalar variables; in Perl, they have names like $1 and $2.

An unsuccessful match leaves the previous memories intact, but a successful one resets them all. This correctly implies that you shouldn't use these match variables unless the match succeeded; otherwise, you could be seeing a memory from some previous pattern. This is another reason a pattern match is almost always found in the conditional expression of an if or while:

$&, $`, and $'.
Whatever came before the matched section is in $`, and whatever was after it is in $'. They have the same scope as the numbered match variables. Generally, that means they'll stay around until the next successful pattern match. The price is that once you use any one of these automatic match variables anywhere in your entire program, other regular expressions will run a little more slowly. Instead, they'll use a workaround. For example, if the only one you need is $&, put parentheses around the whole pattern and use $1 instead.

quantifiers
* {0,}
+ {1,}
? {0,1}

Substitution m///
It's true if a substitution was successful; otherwise it's false.

@fields = split /separator/, $string;

The split operator drags the pattern through a string and returns a list of fields (substrings) that were separated by the separators.

You could even have an empty field if there were two delimiters together. Leading empty fields are always returned, but trailing empty fields are discarded.

The default for split is to break up $_ on whitespace:

    my @fields = split;  # like split /\s+/, $_;

my $result = join $glue, @pieces;

there will be one fewer piece of glue than the number of items in the list.

This means there may be no glue at all if the list doesn't have at least two elements. The first argument to join is the glue, which may be any string. The remaining arguments are a list of pieces. join puts the glue string between the pieces and returns the resulting string.

Though split and join work well together, don't forget that the first argument to join is always a string, not a pattern.

When a pattern match (m//) is used in a list context, the return value is a list of the memory variables created in the match or an empty list if the match failed.

Greedy quantifier
Regular expression engines do a lot of backtracking like that, trying every different way of fitting the pattern to the string until one of them succeeds or until none of them has.

non-greedy /fred.+?Barney/. Prefers to match as few times as possible, rather than as many as possible.

Since the non-greedy form of the plus was +? and the non-greedy form of
the star was *?, you've probably realized the other two quantifiers
look similar. The non-greedy form of any curly-brace quantifier looks the same,
but with a question mark after the closing brace, like {5,10}? or
{8,

The variable $^I. By default it's undef, and everything is
normal. But when it's set to some string, it makes the diamond operator
(<>) more magical than usual.

 $^I = ".bak";

Let's say it's time for the diamond to open our file
fred03.dat. It opens it like before but renames it, calling it
fred03.dat.bak.We have the same file open, but
it has a different name on the disk. Next, the diamond creates a new file and
gives it the name fred03.dat. That's okay because we weren't using that
name anymore. Now the diamond selects the new file as the default for output, so
anything we print will go into that file. The while loop will read a line from the old file,
update that, and print it out to the new file. This program can update hundreds
of files in a few seconds on a typical machine. Pretty powerful, huh?

Some folks use a tilde (~) as the value for $^I since that
resembles what emacs does for backup files.
Another possible value for $^I is the empty string. This enables
in-place editing but doesn't save the original data in a backup file.

non-capturing parentheses, and we write them with a special sequence. We add a
question mark and a colon after the opening parenthesis, (?:), and that tells Perl we only use
these parentheses for grouping.


chapter 10

The idiomatic way of opening a file in Perl looks like this:

open CHAPTER, $filename
or die "Can't open '$filename': $!";


Since using angle brackets means both filehandle reading and globbing, how does
Perl decide which of the two operators to use? Well, a filehandle has to be a
Perl identifier. If the item between the angle brackets is strictly a Perl
identifier, it'll be a filehandle read; otherwise, it'll be a globbing
operation.

A directory handle looks and acts like a filehandle. You open it (with
opendir instead of open), you read from it (with
readdir instead of readline), and you close it (with
closedir instead of close). Instead of reading the contents of a file, you're reading the names of files (and other things) in a directory as in
this example:

  my $dir_to_process = "/etc";
  opendir DH, $dir_to_process or die "Cannot open $dir_to_process: $!";
  foreach $file (readdir DH) {
    print "one file in $dir_to_process is $file\n";
  }
  closedir DH;

Like filehandles, directory handles are automatically closed at the end of the program or if the directory handle is reopened onto another directory.

Since unlink takes a list, and the glob function returns a list, we can combine the two to delete many files at once:

    unlink glob "*.o";

Here's a little-known Unix fact. You can have a file that you can't read, write, or execute, maybe you don't even own the file, that is, it's somebody else's file altogether but you can still delete the file. That's because the permission to unlink a file doesn't depend upon the permission bits on the file; it's the permission bits on the directory that contains the file that matter.

Those first two lines inside the loop can be combined (and often are):

    (my $newfile = $file) =~ s/\.old$/.new/;

This works to declare $newfile, copy its initial value from $file, and modify $newfile with the substitution.

Unix model of files and directories
Each file is stored in a numbered inode, which we can think of as a particular piece of disk real estate. One file might be stored in inode 613, and another in inode 7033. A directory is a special kind of file, maintained by the system. Essentially, it is a table of filenames and their inode numbers. Along with the other things in the directory, there are two special directory entries. One is . (called "dot"), which is the name of that directory; and the other is .. ("dot-dot"), which is the directory one step higher in the hierarchy. How can the system tell that a particular inode is available? Each inode holds a number called its link count. The link count for a directory should always be at least two: its listing in its parent directory and its listing in itself.

There's another rule about the links in directory listings: the inode numbers in a given directory listing refer to inodes on that same mounted volume.

Another restriction on links is they can't make new names for directories because the directories are arranged in a hierarchy.

So, links can't be added to directories, and they can't cross from one mounted volume to another.

A symbolic link (also called a soft link to distinguish it from the true or hard links that we've been talking about up to now) is a special entry in a directory that tells the system to look elsewhere.

A symbolic link can freely cross mounted filesystems or provide a new name for a directory unlike a hard link. A symbolic link can point to any filename, one in this directory or in another oneeven to a file that doesn't exist. But that means a soft link can't keep data from being lost as a hard link can since the symlink doesn't contribute to the link count.

The rmdir operator fails for non-empty directories. As a first pass, you can attempt to delete the contents of the directory with unlink, then try to remove what should now be an empty directory. For example, suppose we need a place to write many temporary files during the execution of a program:

    my $temp_dir = "/tmp/scratch_$$";       # based on process ID; see the text
   mkdir $temp_dir, 0700 or die "cannot create $temp_dir: $!";
   ...
   # use $temp_dir as location of all temporary files
   ...
   unlink glob "$temp_dir/* $temp_dir/.*"; # delete contents of $temp_dir
   rmdir $temp_dir;                        # delete now-empty directory

The initial temporary directory name includes the current process ID, which is unique for every running process and is accessed with the $$ variable (similar to the shell).

Using the program's name as well as the process ID is common, so if the program is called quarry, the directory would be something like /tmp/quarry_$$.

Monday, December 24, 2007

汽车日常维护

1. Engine oil
启动，打开发动机转1分钟，抽出engine oil dipstick (量油计)，正常情况应该是oil level适中，油色清亮。若机油呈暗褐色，说明早该换机油了，若呈粘稠状，亦表示保养很差，甚至表示该车发动机有大问题。 oil level过低，可能有leak。一般3000 ~ 5000 miles左右换一次．

2. Brake Fluid

3. Washer Fluid

4. Engine Coolant
这个咚咚很重要，发动机就靠这个散热，养成看仪表盘的习惯哦．

5. Power Steering Fluid
动力转向油．如何检查？1) 发动引擎发动机到正常工作温度（Hot and Cold的中间）2)发动机引擎空转时左右打方向盘数次，然后关闭引擎　3)　打开HOOD，check the fluid level on the dipstick.

6. Transmission Fluid
一般情况下不需要更换．

跑长途应该注意什么？

检查胎压，各种fluid，还有备coolant, engine oil (optional) etc...

Tachometer　发动机转速表
Dimmer　调光器
Duct　管道
Bass 低音
Treble　高音
N(Neutral) 齿轮的空档

Monday, December 10, 2007

NP Problems

1. Hitting set
http://en.wikipedia.org/wiki/Hitting_set

2. Dominating set
http://en.wikipedia.org/wiki/Dominating_set_problem

3.

Saturday, December 8, 2007

Hashing

Hashing can also be used to compress data!

Friday, December 7, 2007

Spinlock

spin lock: busy waiting
semaphore: sleep and wake-up
spin lock 适用于切换快的场合, 否则会浪费大量cpu时间

a spinlock is a lock where the thread simply waits in a loop ("spins") repeatedly checking until the lock becomes available. As the thread remains active but isn't performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread blocks (aka "goes to sleep").

Spinlocks are efficient if threads are only likely to be blocked for a short period of time, as they avoid overhead from operating system process re-scheduling or context switching. For this reason, spinlocks are often used inside operating system kernels. However, spinlocks become wasteful if held for longer, both preventing other threads from running and requiring re-scheduling. The longer you hold the lock, the greater the risk that you will be interrupted by the O/S scheduler while holding it. If this happens, other threads will be left spinning on the lock, despite the fact that you are not making progress towards releasing it. This is especially true on a single-processor system, where each waiting thread of the same priority is likely to waste its full quantum spinning until the thread that holds the lock is finally re-scheduled.

Implementing spinlocks is difficult, because one must take account of the possibility of simultaneous access to the lock to prevent race conditions. Generally this is only possible with special assembly language instructions, such as atomic test-and-set operations, and cannot be implemented from high level languages like C.[1] On architectures without such operations, or if high-level language implementation is required, a non-atomic locking algorithm may be used, e.g. Peterson's algorithm. But note that such an implementation may require more memory than a spinlock, be slower to allow progress after unlocking, and may not be implementable in a high-level language if out-of-order execution is in use.

Thursday, December 6, 2007

NP and Computational Intractability

P is the set of yes/no problems that can be solved in polynomial time. Intuitively, P is the set of problems that can be solved quickly.
NP is the set of yes/no problems with the following property: If the answer is yes, then there is a proof of this fact that can be checked in polynomial time.
NP-Complete problem is 1) in NP and 2) every problem in NP can be reduced to it.
NP-hard problems may be of any type: decision problems, search problems, optimization problems. A problem H is NP-hard if and only if there is an NP-complete problem L that is polynomial time Turing-reducible to H, L <= H

NP-Complete Problems

Proving further problems NP-Complete only requires a reduction from a single problem already known to be NP-Complete. Given a new problem X, here is the basic strategy for proving it is NP-Complete.

Prove that X is in NP
Choose a problem Y that is known to be NP-complete
Consider an arbitrary instance Sy of problem Y, and show how to construct, in polynomial time, an instance Sx of problem X, such that Sy and Sx have the same answer.

3SAT <= Independent Set <= Vertex Cover <= Set Cover Packing Problems
You are given a collection of objects, and you want to choose at least k of them; there are some conflicts among the objects, preventing you from choosing certain objects simultaneously.

Independent Set
Set Packing

Covering Problems
A natural contrast to packing problems, you are given a collection of objects, and you want to choose at most k of the objects that achieves a certain goal.

Vertex Cover
Set Cover

Partitioning Problems
Divide up a collection of objects into subsets so that each object appears in exactly one of the subsets.

Graph coloring, is at work whenever you are seeking to partition objects in the presence of conflicts, and conflicting objects are NOT allowed to go into the same set.

Sequencing Problems
Searching over the set of all permutations of a collection of objects.

Hamiltonian Cycle
Hamiltonian Path
Traveling Salesman

Numerical Problems

Subset Sum, the special case of the Knapsack Problem. Try reducing from Subset sum whenever one has a problem with weighted objects and the goal is to select objects conditioned on a constraint on the total weight of the objects selected.

Constraint Satisfaction Problems

3-SAT. a useful starting point for reductions where none of the previous categories seem to fit naturally onto the problem being considered. There are two distinct ways to view an instance of 3-SAT, 1) a search over all the assignments to the variables, subject to the constraint that all clauses must be satisfied, and b) as a search over ways to choose a single term from each clause, subject to the constraint that one must not choose conflicting terms from different clauses.

Tuesday, November 27, 2007

Measuring Program Execution Time

Two basic mechanisms to record the passage of time:

Interval Counting, which is based on a low frequency time that periodically interrupts the procesor. The operating system uses the time to record the cumulative time used by each process and maintains counts of the amount of user time and the amount of system time used by each process. There are libraries functions for this time (linux) and clock (time.h, ANSI C). We can compute the total time between two different points in a program execution by making two calls to times and computing the difference of the return values. Disadvantage: The interval accounting scheme makes no attempt to resolve time more finely than the time interval. Interval counting is only useful for measuring relatively long computations-- 100,000,000 clock cycles or more and is too coarse-grained to use for any measurement having duration of less than 100 ms.
Cycle Counters, which is based on a counter that is incremented every clock cycle. The timer is a special register that gets incremented every single clock cycle. Special machine instructions (rdtsc for "read time stamp counter") can be used to read the value of the counter. Cycle counters provide a very precise tool for measuring the time that elapses between two different points in the execution of a program. The disadvantage is that they do not keep track of which process uses those cycles or whether the procesor is operating in kernel or user mode. Context switching and cache operations cause extreme variation in execution time. To overcome this, K-best measurement scheme is proposed but it is reasonably robust for measuring durations shorter than the timer interval.

Process Scheduling and Timer Interrupts: computers have an external timer that periodically generates an interrupt signal to the processor. The spacing between these interrupt signals is called the interval time. When a timer interrupt occurs, the operating system scheduler can choose to either resume the currently executing process or to switch to a different process. This interval time must be set short enough to ensure that the processor will switch between tasks often enough to provide the illusion of performing many tasks simutaneously. Typical timer intervals range between 1 and 10 milliseconds.

Kernel operation (in kernel mode) such as handling page faults, input, or output is considered part of each regular process rather than a separate process. When the scheduler switches from process A to process B, it must enter kernel mode to save the state of process A (still considered part of process A) and to restore the state of proces B (considered part of proces B).

Looking into the future:

Process-specific cycle timing. All that required is to store the count as part of the process' state.
Variable Rate Clocks. Power consumption is directly proportional to the clock rate,and to reduce power consumption, future systems will vary the clock rate.

Sunday, November 25, 2007

Garbage Collection

A garbage collector is a dynamic storage allocator that automatically frees allocated blocks that are no longer needed by the program. Such blocks are known as garbage and hence the term garbage collector. The process of automatically reclaiming heap storage is known as garbage collection.

A garbage collector views memory as a directed reachability graph. The nodes of the graph are partitioned into a set of root nodes and a set of heap nodes. Each heap node corresponds to an allocated block in the heap. Root nodes correspond to locations not in the heap that contain pointers into the heap. Those locations can be registers, variables on the stack, or global variables in the read-write data area of virtual memory. The role of a garbage collector is to maintain some representation of the reachability graph and periodically reclaim the unreachable nodes by freeing them and returning them to the free list.

Mark & Sweep algorithm (by McCarthy): A Mark&Sweep Garbage Collector consists of a mark phase, which marks all reachable and allocated descendants of the root nodes, followed by a sweep phase, which frees each unmarked allocated block by iterating over each block in the heap and freeing any unmarked allocated blocks that it encounters.

Wednesday, November 21, 2007

Dynamic memory allocation

Computer Systems

10.9.5 Implementation issues

A practical allocator that strikes a better balance between throughput and utilization must consider the following issues:

Free block organization: how do we keep track of free blocks?
Placement: How do we choose an appropriate free block in which to place a newly allocated block?
Splitting: After we place a newly allocated block in some free block, what do we do with the remainder of the free block?
Coalescing: What do we do with a block that has just been freed?

10.9.6 Implicit Free Lists

Any practical allocator needs some data structure that allows it to distinguish block boundaries and to distinguish between allocated and free blocks. Most allocators embed this information in the block themselves. An example is as follow.

A block consists of a one-word (4 bytes) header, the payload, and possibly some additional padding. The header encodes the block size as well as whether the block is allocated or free. If we impose a double-word alignment constraint, then the block size is always a multiple of eight and the three low-order bits of the block size are always zero.

This organization is called implicit free list because the free blocks are linked implicitly by the size fields in the headers. The allocator can indirectly traverse the entire set of free blocks by traversing all of the blocks in the heap. We also need some kind of specially marked end block. The advantage of an implicit free list is simplicity. A significant disadvantage is the cost of any operation, such as placing allocated blocks, requires a search of the free list will be linear in the total number of allocated and free blocks in the heap.

10.9.7 Placing Allocated Blocks

Three common placement policies are: first fit, next fit and best fit. All of them have advantages and disadvantages.

10.9.8 Splitting Free Blocks

If the fit is not good, then the allocator will usually opt to split the free block into two parts. The first part becomes the allocated block, and the remainder becomes a new free block.

10.9.9 Getting Additional Heap Memory

If the allocator is unable to find a fit for the requested block, it will ask the kernel for additional memory, either by calling the mmap or sbrk functions.

10.9.10 Coalescing Free Blocks

There are free blocks adjacent to the newly freed block, which is known as false fragmentation. Coalescing comes in to rescue this phenomenon by merge adjacent free blocks. When to perform such operations??? Immediate coalescing Vs Defer coalescing. The former could introduce a form of thrashing, where a block is repeatedly coalesced and then split soon thereafter.. Fast allocators often opt for some form of deferred coalescing.

10.9.11 Coalescing with Boundary Tags

Coalescing the next free block is straightforward and efficient. How would we coalesce the previous block? Knuth developed boundary tags, which allows for constant-time coalescing of the previous block. The idea is to add a footer (the boundary tag) at the end of each block, where the footer is a replica of the header. If each block includes such a footer, then the allocator can determine the starting location and status of the previous block by inspecting its footer, which is always one word away from the start of the current block.

10.9.14 Segregated Free Lists (A popular choice with production-quality allocators such as GNU malloc)

Segregated storage: maintain multiple free lists, where each list holds blocks that are roughly the same size. Two basic approaches: simple segregated storage and segregated fits.

Simple segregated storage: the free list for each size class contains same-sized blocks, each the size of the largest element of the size class.

Segregated Fits: Each list contains potentially different-sized blocks whose sizes are members of the size class.

Saturday, November 17, 2007

Link Analysis Algorithms: HITS

PageRank

Thursday, November 15, 2007

Concurent Programming with Threads

Each thread associated with a process has its own stack, registers, condition codes, stack pointer, program counter and thread ID (TID). A thread context switch is faster than a process context switch.

Reaping Terminated Threads

pthread_join (pthread_t tid, void **thread_return);
The pthread_join function blocks until thread tid terminates and then reaps any memory resources held by the terminated thread.

Joinable Vs Detached Thread

A thread is either joinable or detached. A joinable thread can be reaped and killed by other threads. Its memory resources are NOT freed until it is reaped by another thread. A detached thread cannot be reaped or killed by other threads. Its memory resources are freed automatically by the system when it terminates. By default, threads are created joinable. Threads can detach themselves by call pthread_detach(pthread_self()).

Protecting Shared Variables with Semaphores

A semaphore, s, is a global variable with a nonnegative values can only be manipulated by two special operations, called P (to test) and V (increment):

P(s): while (s <= 0) ; s-- ;
V(s): s++;

A binary semaphore is often called mutex, which provides mutual exclusive access to the shared data.

Signaling with Semaphores

A classic example is producer-consumer interaction, which is common in real systems. For the producer-consumer interaction, we typically need two semaphores, called full and empty.

Producer:

P(empty);
write buffer;
V(full)

Consumer:

P(full)
read buffer;
V(empty)

-------------------------------------------------------------
Synchronizing Threads with Mutex and Condition Variables

Mutex is used to protect sharing and like a binary semaphore. And condition variables are used for signaling. To understand condition variables, we will take pthread as an example.

pthread_cond_init(pthread_cond_t * cond, pthread_condattr_t *attr);
pthread_cond_wait(pthread_cond_t * cond, pthread_mutex_t *mutex);
pthread_cond_signal(pthread_cond_t * cond);

Pthreads require that a mutex variable be associated with the condition variable cond, and that a call to pthread_cond_wait must always be protected by that mutex:

pthread_mutex_lock(&mutex);
pthread_cond_wait(&cond, &mutex);
pthread_mutex_unlock(&mutex);

some thread later indicates that the condition associated with cond has become true by making a call pthread_cond_signal(&cond), which wakes up exactly one of the waiting threads.

Reentrant Functions: DO NOT reference any shared data when they are called by multiple threads and thus they require no synchronization operations.

Other synchronization errors

Races: A race occurs when the correctness of a program depends on one thread reaching point x in its control flow before another thread reaches point y. A golden rule is that threaded programs must work correctly for any feasible trajectory in the progress graph.

Deadlock: a collection of threads are blocked, waiting for a condition that will never be true.

Wednesday, November 14, 2007

TCP/IP

TCP Protocol runs only in the end systems and not in the intermediate network elements, the intermediate network elements do not maintain TCP connection state. A TCP connection is also always point-to-point, between a single sender and a single receiver.

Three-way handshake connection-establishment procedure: the client first sends a special TCP segment, the server responds with a second special TCP segment, and finally the client responds again with a third special segment. The first two segments carry no payload, that is, no application data; the third of these segments may carry a payload.

TCP segment structures: 1) source port # 2) dest port # 3) sequence number 4) acknowledge number 5) receive window 6)checksum 7)Sign bits: ack urg, psh,rst,syn, fin8) header length 9)options 10)data 11)urgent data pointer

Sequence numbers: TCP views data as an unstructured, but ordered, stream of bytes. Sequence number for a segment is therefore the byte-stream number of the first byte in the segment.

Acknowledge numbers: The acknowledgment number that Host A puts in its segment is the sequence number of the next byte Host A is expecting from Host B. Because TCP only acknowledges bytes up to the first missing byte in the stream, TCP is said to provide cumulative acknowledgments.

TCP Congestion Control: TCP congestion-control mechanism has each side of a connection keep track of an additional variable, the congestion window, which imposes a constraint on the rate at which a TCP sender can send traffic into the network. 1) additive-increase, multiplicative-decrease 2) Slow start, the TCP sender begins by transmitting at a slow rate(hence the term slow start), but increases its sending rate exponentially. 3) reaction to timeout events. After a timeout event, a TCP sender enters a slow-start phase. TCP manages a variable called Threshold, which determines the window size at which slow start will end and congestion avoidance will begin. The canceling of the slow-start phase after a triple duplicate ACK is called fast recovery.

Priority Queue

Sunday, November 11, 2007

Graph Algorithms

Strongly connected components

String Algorithms

1. anagram (a word or phrase made by transposing the letters of another word or phrase)

Question: Find all the anagrams in a dictionary
Solution: 1) Sign each word in the dictionary so that words in the same anagram class have the same signature, and then 2) bring together words with equal signatures. A signature scheme based on sorting is as follow: order the letters within the word alphabetically. To bring together, we can sort the words in the order of their signatures.

2. Palindrome ( a word, verse, or sentence (as *Able was I ere I saw Elba*) or a number (as 1881) that reads the same backward or forward)

3. String rotation without extra memory, used in text editor (move lines)

For example, abcdefg -> efgabcd. Step 1: Reverse the substring separately. abcdefg->dcbagfe Step 2: Reverse the whole string: dcbagfe->efgabcd

Saturday, November 10, 2007

Trie or Prefix Tree

Thursday, November 8, 2007

Sorting in Linear time

1. Counting Sort (input consists of integers in a small range)
2. Radix Sort
3. Bucket Sort (input is uniformly distributed)

Any comparison sort algorithm requires nlgn comparisons in the worse case.

Counting Sort (Running time: theta(n))

Counting sort assume that each of the n input elements is an integer in the range 0 to k, for some integer k. The basic idea of counting sort is to determine, for each input element x, the number of elements less than x. This information can be used to place element x directly into its position in the output array. We need to do a slight modification to handle the situation in which several elements have the same value. In practice, we usually use counting sort when we have k = O(n), the running time is theta(n).

Counting sort is stable: numbers with the same value appear in the output array in the same order as they do in the input array, which is very important since counting sort is often used as a subroutine in radix sort.

Counting sort requires two other arrays: the array B[1...n] holds the sorted output, and the array C[0...k] provides temporary working storage.

The algorithm itself is not complicated, watch out boundary errors when you try to implement it.

Radix Sort

Radix-Sort(A, d)
1. for i <-- 1 to d
2. do use a stable sort to sort array A on digit i

Radix sort can be used to sort records of information that are keyed by multiple fields.

Given n d-digit numbers in which each digit can take on up to k possible values, Radix-Sort correctly sorts these numbers in theta(d(n+k)) time. When d is constant and k = O(n), radix sort runs in linear time.

Bucket Sort

Bucket sort, or bin sort, is a sorting algorithm that works by partitioning an array into a finite number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. Bucket sort is a generalization of pigeonhole sort.

WinCVS

http://www.devdaily.com/wincvs/HowToUseWinCVS/

http://ikon.as/wincvs-howto/#startingnew

TightVNC

An enhanced version of VNC (an abbreviation for Virtual Network Computing) is a great client/server software package allowing remote network access to graphical desktops.

TightVNC can be used to perform remote control and administration tasks in Windows, Unix and mixed network environments.

http://www.tightvnc.com/

Dead Lock

Network Flow

Terms:

A Flow in graph G must satisify 1) capacity conditions. For each e belong to G, 0<=f(e)<=Ce. 2) conservation conditions. For each node v other than s and t, the flow value over all edges entering node v equals the sum of flow values over all edges leaving node v.
The capacity of a cut, denoted as c(A, B), is the sum of the capacities of all edges out of A.

Properties:

The residual graph Gf has at most twice as many edges as G.

The value of every flow is upper-bounded by the capacity of every cut. Let f be any s-t flow, and (A, B) any s-t cut, Then the value of flow v(f) <= c(A, B) , which is the capacity of a cut (A, B).

Max-Flow Min-Cut Theorem

In every flow network, there is a maximum flow f and a minimum cut (A, B) so that v(f) = c(A,B). There can be many minimum-capacity cuts in a graph G. All edges out of A are completed saturated with flow, while all edges into A are completely unused.

Ford-Fulkerson Algorithm: this algorithm will terminate when there is no s-t path in the residual graph. The algorithm is as follows.

Initially f(e)=0 for all e in G
While there is an s-t path in the residual graph Gf
...Let P be a simple s-t path in Gf
...f'=augment(f,P) // the running time for this helper function is O(m)
...Update f to f'
...Update the residual graph Gf to be Gf'
Endwhile
Return f

Helper function
augment(f,P)
..Let b = bottleneck(P,f)
..For each edge (u,v) in P
....If e=(u,v) is a forward edge then then
......increase f(e) in G by b
....Else ((u,v) is a backward edge, and let e=(v,u))
.....decrease f(e) in G by b
....Endif
..Endfor
..Return ()

Edmonds-Karp Algorithm

Tuesday, November 6, 2007

LaTeX

A very power formatter, you should not miss it.

Good tutorial on the website:

http://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/

Monday, November 5, 2007

Associative array: Hash Table vs BST

An associative array (also map, mapping, hash, dictionary, finite map, lookup table, and in query-processing an index or index file) is an abstract data type composed of a collection of keys and a collection of values, where each key is associated with one value. The operation of finding the value associated with a key is called a lookup or indexing, and this is the most important operation supported by an associative array. The relationship between a key and its value is sometimes called a mapping or binding.

Associative arrays are usually used when lookup is the most frequent operation. For this reason, implementations are usually designed to allow speedy lookup, at the expense of slower insertion and a larger storage footprint than other data structures.

Representations: Hash Table vs Binary Search Tree (self-balancing)

There are two main efficient data structures used to represent associative arrays, the hash table and the self-balancing binary search tree. Skip lists are also an alternative, though relatively new and not as widely used. Relative advantages and disadvantages include:

Hash tables have faster average lookup and insertion time (O(1)), while some kinds of binary search tree have faster worst-case lookup and insertion time (O(log n) instead of O(n)). Hash tables have seen extensive use in real time systems, but trees can be useful in high-security real time systems where untrusted users may deliberately supply information that triggers worst-case performance in a hash table, although careful design can remove that issue. Hash tables shine in very large arrays, where O(1) performance is important. Skip lists have worst-case operation time of O(n), but average-case of O(log n), with much less insertion and deletion overhead than balanced binary trees.

Hash tables can have more compact storage for small value types, especially when the values are bits.

There are simple persistent versions of balanced binary trees, which are especially prominent in functional languages.

Building a hash table requires a reasonable hash function for the key type, which can be difficult to write well, while balanced binary trees and skip lists only require a total ordering on the keys. On the other hand, with hash tables the data may be cyclically or partially ordered without any problems.

Balanced binary trees and skip lists preserve ordering — allowing one to efficiently iterate over the keys in order or to efficiently locate an association whose key is nearest to a given value. Hash tables do not preserve ordering and therefore cannot perform these operations as efficiently.

Balanced binary trees can be easily adapted to efficiently assign a single value to a large ordered range of keys, or to count the number of keys in an ordered range.

http://en.wikipedia.org/wiki/Associative_array

Sunday, November 4, 2007

Splay Tree

A splay tree is a self-balancing binary search tree. A splay tree guarantees that any M consecutive tree operations starting from an empty tree take at most O(MlogN) time. The net effect is: there is no bad input sequences.

Good performance for a splay tree depends on the fact that it is self-balancing, and indeed self optimizing, in that frequently accessed nodes will move nearer to the root where they can be accessed more quickly. This is an advantage for nearly all practical applications, and is particularly useful for implementing caches; however it is important to note that for uniform access, a splay tree's performance will be considerably (although not asymptotically) worse than a somewhat balanced simple binary search tree.

The basic idea of the splay tree is that after a node is accessed, it is pushed to the root by a series of avl tree rotations. The intuition behind this: when a node is accessed, it is likely to be accessed again in the near future in many real applications. The side effect is a balancing tree to some extent.

The splay tree does NOT require the maintenance of height or balance information, thus saving space and simplifying the code to some extent. Splay trees are much simpler to program than AVL trees.

Assume X be a node on the access path at which we are rotating. X has both a parent (P) and a grandparent (G).
Splay Operations: 1) zig-zag case, where X is a right child and P is a left child (or vice versa). we perform a double rotation, exactly like an AVL double rotation. 2) zig-zig case. X and P are both left children. 3) zig-case. X's only parent is the root.

http://en.wikipedia.org/wiki/Splay_tree

Saturday, November 3, 2007

avl Tree: balanced binary search tree

An avl (Adelson-Velskii and Landis) tree is a binary search tree with a balance condition, the height of the left and right subtrees can differ by at most 1. The depth of the tree is O(logn).

After an insertion, only nodes that are on the path from the insertion point to the root might have their balance altered because only those nodes have their subtrees altered.

Possible violation might occur in four cases:

1. An insertion into the left subtree of the left child
2. ...................................right........................................
3. .................................... left ........................ right .....
4. .................................. right ....................... right .....

1. and 4. could be fixed by single rotate

2. and 3. could be fixed by double rotate

Rotation operations only need a few pointer changes. The node will need to record the height info within it.

Deletion in avl Tree is somewhat more complicated than insertion. And lazy-deletion is probably the best strategy.

Reference: Data structures and Algorithm Analysis in C++ Page 136

B-Trees

a B-tree is a tree data structure that keeps data sorted and allows searches, insertions, and deletions in logarithmic amortized time. It is most commonly used in databases and filesystems.

A B-tree is kept balanced by requiring that all leaf nodes are at the same depth. This depth will increase slowly as elements are added to the tree, but an increase in the overall depth is infrequent, and results in all leaf nodes being one more hop further removed from the root.

B-trees have substantial advantages over alternative implementations when node access times far exceed access times within nodes. This usually occurs when most nodes are in secondary storage such as hard drives. By maximizing the number of child nodes within each internal node, the height of the tree decreases, balancing occurs less often, and efficiency increases. Usually this value is set such that each node takes up a full disk block or an analogous size in secondary storage. While 2-3 B-trees might be useful in main memory, and are certainly easier to explain, if the node sizes are tuned to the size of a disk block, the result might be a 129-513 B-tree.

Search: Search is performed in the typical manner, analogous to that in a binary search tree. Starting at the root, the tree is traversed top to bottom, choosing the child pointer whose separation values are on either side of the value that is being searched. Binary search is typically (but not necessarily) used within nodes to find the separation values and child tree of interest.

---------------------------------------
2-3 Tree

* Every non-leaf node has 2 or 3 children
* All data is kept in the leaves
* All leaves are at the same level (the bottom level)
* All data is kept in sorted order
* Every non-leaf node will contain 1 or 2 fields.

These contain one or two fields which indicate the range of values in its subtrees. If a node has two children, it will have one field; if the node has three children, it will have two fields. Each non-leaf node will contain a value in field 1 which is greater than the largest item in its left sub-tree, but less than or equal to the smallest item in its right sub-tree (or center sub-tree, if it has three children). If that node has three children, field 2 contains a value which is greater than the largest value in the center sub-tree, but less than or equal to the smallest item in its right sub-tree. The purpose of these values is to direct a search function to the correct sub-tree and eventually to the correct data node.

The biggest drawback with the 2-3 tree is that it requires more storage space than the normal binary search tree.
---------------------------------------
2-3-4 Trees

A 2-3-4 tree is a B-tree of order 4.

2-3-4 trees are relatively difficult to implement in most programming languages because of the large number of special cases involved in operations on the tree. Red-black trees (see below) are simpler to implement, so they tend to be used instead.
---------------------------------------

http://en.wikipedia.org/wiki/B-tree
http://en.wikipedia.org/wiki/2-3_tree

Red-Black Tree

A binary search tree is a red-black tree if:

1. Every node is either red or black
2. Every leaf (nil) is black
3. If a node is red, then both its children are black
4. Every simple path from a node to a descendant leaf contains the same number of black nodes

Black-height of a node x, bh(x), is the number of black nodes on any path from x to a leaf, not counting x. A red-black tree with n internal nodes has height at most 2lg(n+1).

Thursday, November 1, 2007

Suffix Array

Suffix array: an array of pointers to characters. The pointers in the array point to every suffix in the string, hence the name "suffix array". Suffix array can be used to efficiently search a large text.

Applications:

1. Find the longest duplicated substring of characters in it.
The duplicated substring mush appear in two different suffixes. We sort the suffix array to bring together equal suffix. and then scan through this array comparing adjacent elements to find the longest repeated string. The running time is O(nlogn) due to sorting.

1.1. Given a new input string, How would you search a suffix array to find the longest match in the stored text?
1.2. Given two input texts, find the longest string that occurs in both.

2. Locate every occurrence of a substring within the string.
Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes

O (m log n)

time, where

m

is the length of the substring. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving

O (m + log n)

search time.

http://en.wikipedia.org/wiki/Suffix_array

Wednesday, October 31, 2007

Suffix Tree

The suffix tree for a string

S

is a tree whose edges are labeled with strings, and such that each suffix of

S

corresponds to exactly one path from the tree's root to a leaf. It is thus a radix tree for the suffixes of

S

.

Constructing such a tree for the string

S

takes time and space linear in the length of

S

.

Applications:
1. Check if a string

P

of length

m

is a substring in

O (m)

time

Subsequence, Substring, Suffix, Prefix

Subsequence is a generalization of substring, suffix and prefix. Finding the longest string which is equal to a subsequence of two or more strings is known as the longest common subsequence problem.

Substring of a string is a prefix of a suffix of the string, and equivalently a suffix of a prefix.

Lexicographical order(Lexicographic), dictionary order, alphabetic order
one may assume that all words have the same length, by adding blank spaces at the end, and considering the blank space as a special character which comes before any other letter in the alphabet. This also allows ordering of phrases.

References:
http://en.wikipedia.org/wiki/Suffix_%28computer_science%29
http://en.wikipedia.org/wiki/Suffix_tree
http://en.wikipedia.org/wiki/Lexicographical_order