Here is a generative grammar for arithmetic expressions taken from the Dragon Book p. 198:

E → E + T
E → E - T
E → T
T → T * F
T → T / F
T → F
F → ( E )
F → number

We can match its output with a Ruby 1.9 regular expression:

/^(?<E>(?<T>(?<F>([1-9]\d*|0)|\(\g<E>\))((\*|\/)\g<T>)*)((\+|\-)\g<E>)*)$/

For numbers, the regular expression matches non-negative integers.

The generative grammar is written to produce a rightmost derivation, but the regular expression must be written as if it were producing a leftmost derivation:

$ irb
irb(main):001:0> /(?<S>(\g<S>\+)*\d+)/
SyntaxError: (irb):1: never ending recursion: /(?<S>(\g<S>\+)*\d+)/
	from /usr/local/bin/irb:12:in `<main>'
irb(main):002:0> /(?<S>\d+(\+\g<S>)*)/
=> /(?<S>\d+(\+\g<S>)*)/
irb(main):003:0> /(?<S>\d+(\+\g<S>)*)/.match("1+2+3")
=> #<MatchData "1+2+3" S:"1+2+3">

A formal regular expression is built from characters and three operations: concatenation, alternation, and the Kleene star. A formal regular expression can’t distinguish strings with balanced parentheses from those with unbalanced parentheses. Also it can’t match the set

{ aⁿbⁿ | n ≥ 1 }

where aⁿ  is n occurrences of a. This follows from the equivalence of formal regular expressions and deterministic finite automata, a result shown in a 1956 paper by S.C. Kleene.

Some languages have regular expression extensions which make the above set matchable. In perl 5.8.8 we can do the following:

$ perl -e ' $re = qr/a(??{$re})*b/; print "matches\n" if "ab" =~ /^($re)$/ '
matches
$ perl -e ' $re = qr/a(??{$re})*b/; print "matches\n" if "aaaabbbb" =~ /^($re)$/ '
matches
$ perl -e ' $re = qr/a(??{$re})*b/; print "matches\n" if "aaaab" =~ /^($re)$/ '
$

In Ruby 1.9 we can do this:

$ irb
irb(main):001:0> "ab".match(/^(?<c>a(\g<c>)*b)$/)
=> #<MatchData "ab" c:"ab">
irb(main):002:0> "aaaabbbb".match(/^(?<c>a(\g<c>)*b)$/)
=> #<MatchData "aaaabbbb" c:"aaaabbbb">
irb(main):003:0> "aaaab".match(/^(?<c>a(\g<c>)*b)$/)
=> nil

This feature isn’t available in Ruby 1.8, even though the Oniguruma regexp engine added it in 2003.

Python allows portions of the regular expression to be named and referenced, but as of 2.5.1 this cannot be done recursively:

$ python
Python 2.5.1 (r251:54863, Feb  6 2009, 19:02:12)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.match("^(?P<c>a(?P=c)*b)$", "ab")
<_sre.SRE_Match object at 0x787e0>
>>> re.match("^(?P<c>a(?P=c)*b)$", "aaaabbbb")
>>> re.match("^(?P<c>ab)(?P=c)*$", "ababab")
<_sre.SRE_Match object at 0x788a0>

Python 3.1.1 also appears to lack recursion.

C++ has class privacy, not object privacy:

$ cat private.cc
#include <stdio.h>

class A {
private:
  void privy() { printf("hello\n"); }
public:
  void pubby(A *a) { a->privy(); }
};

int
main(int argc, char **argv) {
  A *a1 = new A;
  A *a2 = new A;
  a2->pubby(a1);
}
$ g++ private.cc
$ ./a.out
hello

Ruby has object privacy, not class privacy:

$ cat private.rb
class A
  private
  def privy
    puts "hello"
  end

  public
  def pubby(other)
    other.privy
  end
end

a1 = A.new
a2 = A.new
a2.pubby(a1)

$ ruby private.rb
private.rb:9:in `pubby': private method `privy' called for #<A:0x3e664c> (NoMethodError)
	from private.rb:15:in `<main>'

The GUI utility for poking special characters into a document is called Character Palette on Mac. On Windows it is called Character Map.

Mac Character Palette is better. It supports characters outside the Basic Multilingual Plane. Windows Character Map apparently uses UCS-2 internally. Even within the BMP, the fonts that Mac has installed by default cover more Unicode points. Mac will display all the glyphs for a selected Unicode point in the Font Variation drop down.

Windows, in contrast, makes the user choose a font, and then shows the supported points for the font and their glyphs. If the point isn’t supported, the user must try a different font. Each time the font is changed, the range of points displayed is reset to 0000-00A9. Thus looking through the fonts is tedious, especially if the user doesn’t know the Unicode point of the desired character.

Character Palette makes browsing easier by displaying the Unicode point numbers in the margins. It also organizes the Unicode characters into named sections, and displays the section names. Character Map will also do this if the value in the Group By box is set to Unicode Subrange, in which case the section names appear in a separate window.

Windows Character Map displays just 10 x 16 characters at a time. Character Palette is resizable, but can easily display 12 x 16 characters.

Mac can keep frequently used characters in a list of favorites.

character_palette

The Mac Character Palette is an always-on-top window, which I find irritating. When inserting characters, they go into the window with focus, but it is easy to forget which window has focus when the Character Palette is covering it up.

Windows Character Map does not insert characters directly into other applications. Instead it uses the system clipboard, adding an extra keystroke to the process. However, several characters can be selected and added to the clipboard together.

character_map

Versions used: Mac OS X 10.5 and Windows Vista

Cuneiform Ruby

July 3, 2009

I’m guessing that you don’t have a cuneiform font installed. Install one now.

Cuneiform characters were officially assigned Unicode points in 2006. None of these points are in the Basic Multilingual Plane. A lot of software which claims to handle Unicode cannot handle points outside of the BMP. For example, before version 1.5 java assumed all characters could fit into 16 bits. So cuneiform is a good test of the system.

Ruby 1.9 extends the \u notation for entering Unicode points in strings:

bash$ irb
irb(main):001:0> puts "\u{1204C}"
𒁌

Better yet, use cuneiform as a variable

bash$ cat cuneiform.rb
# encoding: utf-8
𒀻     = 13
puts 𒀻

bash$ ruby cuneiform.rb
13

I had better luck with vi than emacs editing the file. There were display issues. Beacuse 𒀻 is quite wide for a single character and Terminal is fixed width, the cuneiform character was overlapping the characters that follow it.

bash$ cat cuneiform.rb
# encoding: utf-8

class 𒀽
  def puts
    puts "I am cuneiform: \u{1203D}"
  end
end

c = 𒀽.new
c.puts
bash$ ruby cuneiform.rb
cuneiform.rb:3: class/module name must be CONSTANT

So this tells us that Ruby considers the character to be lower case. Ruby doesn’t appear to recognize any non-ASCII capital letters in source code:

bash$ cat greek.rb
#encoding: utf-8
class Λ
  def initialize(λ)
    @λ = λ
  end
  def apply(*a)
    @λ.call(*a)
  end
end

λ = Λ.new(lambda { |x,y| x+y })
puts λ.apply(3,7)

bash$ ruby greek.rb
greek.rb:2: class/module name must be CONSTANT

Ruby is aware of capitals in strings:

irb(main):010:0> s = "\u039B"; puts s; /\p{Upper}/.match(s)
Λ
=> #<MatchData "Λ">
irb(main):011:0> s = "\u03BB"; puts s; /\p{Upper}/.match(s)
λ
=> nil
irb(main):012:0> s = "\u{12040}"; puts s; /\p{Upper}/.match(s)
𒁀
=> nil

We can use Unicode in symbols:

# encoding: utf-8
alias :λ :lambda
add = λ { |x,y| x + y }
puts add.call(3,11)

Kindle 2 Thoughts

June 28, 2009

I like this device. I’d like to get rid of my four shelves of books and rely on this thing. Unfortunately, we’re not quite there yet, since many of the books aren’t available as Kindle editions. The replacement cost would be high, though if I paid for my Kindle books as I needed them, it might not be too bad. Many of my books I probably won’t use again.

Some areas for improvement in order of declining importance:

availability and quality of Kindle books:
Lots of books still not available. Some Kindle books have quality issues. Some lack a table of contents with links, and left and right on the 5-way controller, which is supposed to move by chapter or section, doesn’t always work. I see books with bits like “procrast- ination”, which suggest some sort of problem in the conversion to Kindle process. Downloading a sample of the book first or reading reviews can help avoid these bad books, but it might not be a big deal since the bad books are usually the dollar books.

navigation:
Should be able to execute commands in the menu with one or two keystrokes. Droppping the menu and navigating to the desired action with the 5 way can easily be 5 or more keystrokes (counting the 5 way as a key). Also, sometimes a sentence or paragraph is broken by the page break, and it is desirable to quickly see the whole. I would like to be able to reposition the page by one location with a single stroke. Best I can do now is use the goto location in the menu. There is some weirdness I don’t understand about the goto location, because sometimes the location provided doesn’t become the beginning of the page. Are not all locations positionable at the top of the page? Also, if you page backward, sometimes it takes more page forwards to return to the original position, which is disorienting.

screen refresh:
Too slow, affects navigation, both when paging and moving the cursor. Is this a limitation of the screen technology or the speed of the processor?

symbols
Using the symbol palette that pops up when you hit the SYM key stinks. First of all, there aren’t enough symbols. When making a note on a mathematical text, I would like to have Greek letters. Also, when using the 5 way, I find it too easy to push in instead of push left, which results in the wrong symbol selected. Not sure what the solution is. Have every symbol in the table numbered, so it can be selected with one or two keystrokes? Or add symbols to the number keys like a regular keyboard (plenty of space above the keyboard)?

internationalization:
There are some foreign language texts, and accented Latin displays correctly, but no foreign language dictionaries. Ideally, I would like to set two dictionaries simultaneously when reading a foreign text, to have for example a Spanish dictionary and a Spanish-to-English dictionary.

pdf:
Sent a PDF to amazon to have it converted and downloaded. It was a challenging document, 3M of images scans of a 1937 master’s thesis. Amazon chopped it up into pages. One page would be the best guess of the text in the image, and the next would be the image. Document size on the kindle didn’t increase much, and the process took about 5-10 minutes and cost $0.75. Although the text was ugly and had lots of errors and was not completely necessary because the image quality was readable, I see the value of including it for search. Unfortunately, I didn’t change the file name when I attached it in the email, and ended up with 344541425.pdf as the document name on my home screen. Could not find out how to change it. Not even hooking up my laptop via USB and editing the file name.

web, mp3:
The browser could be improved. The external speakers make crackle noise. I haven’t tried plugging in headphones.

other:
One can image other improvements: color screen, backlit screen. These may add more cost than value. I hanker for a stylus like the old palm pilot, for taking notes, or even doing the exercises in a technical work, so I don’t have to have an extra notebook and pen. Seems like there is room for increasing the screen size without increasing the overall size of the device: about an inch is wasted horizontally.

gcc and the mac

June 20, 2009

Re-acquainting myself with gcc. Since the man page is extensive, I summarize some useful options.

-c : compile and assemble, don’t link
-S : compile, don’t assemble
-E : preprocess, don’t compile
-o FILE : create target named FILE
-v : print version number. Output commands used during compilation
-Wall : highest warning level
-w : turn off warning
-g : add debugging info to object (for gdb)
-p : add profiling info to object (for prof)
-pg : add profiling info to object (for gprof)
-Q : output compiler stats for each function
-O,-O1 : optimize
-O2 : optimize more
-O3 : highest optimization level
-fast : optimize for speed
-Os : optimize for size
-D NAME : for preprocessor, set macro NAME to 1
-D NAME=VALUE: set macro NAME to VALUE
-U NAME : undefine macro NAME
-I DIR : add DIR to include search path
-L IDR : add DIR to library search path
-M : output make dependencies
-l LIB : search libLIB for symbol definitions (linker)

Some oddities peculiar to the mac that I’m not used to. Here’s how to create a shared object:

gcc -c stack.c
gcc -dynamiclib -o libstack.dylib -dylib stack.o

Also, the mac version of gcc appends an underscore to symbols

bash $ cat main.c
#include <stdio.h>

void
hello_world() {
  printf("hello world\n");
}

int
main(int argc, char **argv) {
  hello_world();
}

bash $ gcc main.c
bash $ ./a.out
hello world
bash $ nm a.out
0000200c D _NXArgc
00002008 D _NXArgv
00002000 D ___progname
00001fb4 t __dyld_func_lookup
00001000 A __mh_execute_header
00002004 D _environ
         U _exit
00001fc2 T _hello_world
00001fe3 T _main
         U _puts
00002010 d dyld__mach_header
00001fa0 t dyld_stub_binding_helper
00001f60 T start

Shortcut Keys

June 13, 2009

Note to self: these shortcut keys are useful. Practice using them.

mac

I set the dock to hide itself. I set the number of spaces to two, so that I am never searching for a window. If I don’t see it, it’s in the other space. Also, both arrow keys take me to the other space, so I don’t need to remember where I am. Also just learned about pbcopy and pbpaste.

cmd-del        move selection to trash
cmd-rarrow   move between spaces (click and hold on window to carry it with you)
cmd-larrow    move between spaces
cmd-E            eject
cmd-W           close window
cmd-Q           quit app
cmd-+           adjust magnification
cmd--
F11                 expose (at least that's where I want it)
cmd-M            minimize window
cmd-TAB         change app

emacs

Don’t exit emacs to prowl the filesystem or run a command. Wasteful.

M-!              execute external cmd
M-x shell    shell buffer
C-c C-c       send C-c (shell mode)
C-c C-d       send C-d (shell mode)
C-c C-z       send C-z (shell mode)
C-x C-b      display buffer list
M-x sort-lines
M-x reverse-region
^                  visit parent directory (dired)
d                  mark for delete (dired)
i                   display contents of subdir at bottom of same buffer (dired)
o                  open file in different window (C-o doesn't change focus)
m                 mark (dired)
u                  unmark (dired)
x                  expunge files marked for delete (dired)
#                  mark auto-save files for deletion (dired)
~                  mark emacs backup files for deletion (dired)
D                 delete marked files (dired)
R                  rename marked files (multiple files are moved into dir)
C                  copy marked files
+                 make a subdir (dired)
$                  hide subdir (dired)
M-$              hide all subdirs (dired)
M-x find-name-dired
M-x find-grep-dired
M-x find-grep
M-x wdired-change-to-wdired-mode   rename and move files by editing the dired buffer

ssh, screen

Remember to ssh user@host instead of ssh host followed by su user. Use ssh -X to launch X windows, and then emacs can be in its own window. Also, you pay for that up front laziness: how to set up ssh keys.

Basic screen procedure: run screen, run the comand, then detach with C-a d. Reattach with screen -r. If multiple screen sesssions, specify pid and terminal. Screen is unkind to emacs users because of the C-a binding but that is all you need. Stop using nohup.

firefox

installed “It’s All Text!” and reinstalled Firebug and YSlow. For the record the other addons I find useful: Live HTTP Headers, Web Developer, google and del.icio.us toobars.

websites

got a github account. Pastie at gist.github.com

also look into

  • running svn commands from within emacs
  • keep some standard dot files .bashrc .irbrc .emacs at github
  • install these emacs files
  • color highlighting in emacs not working on linux

Io has a normally distributed random number generator:

bash $ io
Io 20090105
Io> Random gaussian
==> 1.0989883665373978
Io> Random gaussian(100,10)
==> 88.2743995098800269

The underlying source code uses the Box-Muller transform:

double RandomGen_gaussian(RandomGen *self, double m, double s)
{
		// http://www.taygeta.com/random/gaussian.html
	double x1, x2, w, y1, y2;

	do {
		x1 = 2.0 * RandomGen_randomDouble(self) - 1.0;
		x2 = 2.0 * RandomGen_randomDouble(self) - 1.0;
		w = x1 * x1 + x2 * x2;
	} while ( w >= 1.0 );

	w = sqrt( (-2.0 * log( w ) ) / w );
	y1 = x1 * w;
	y2 = x2 * w;

		// The following code resulted in a lot of nans being returned. The
		// following code *should* also be slower.
		/*
	double x1 = 2.0 * RandomGen_randomDouble(self) - 1.0;
	double x2 = 2.0 * RandomGen_randomDouble(self) - 1.0;
	double y1 = sqrt( - 2.0 * log(x1) ) * cos( M_PI_2 * x2 );
		*/

	return ( m + y1 * s );
}

Factor also has a normally distributed generator:

bash $ factor
( scratchpad ) USE: random
( scratchpad ) 0 1 normal-random-float

--- Data stack:
-1.651647700885478

The interesting story here is the Factor source is algorithmically identical to what is commented out in the Io source:

: normal-random-float ( mean sigma -- n )
    0.0 1.0 uniform-random-float
    0.0 1.0 uniform-random-float
    [ 2 pi * * cos ]
    [ 1.0 swap - log -2.0 * sqrt ]
    bi* * * + ;

So a normally distributed random variable is easy to simulate, but what about the other distributions? It occurred to me that the inverse of the cumulative distribution function (qv quantile function) of an RV will convert a standard uniform random number into a sample from the RV.

Since most languages don’t have cumulative distribution functions, much less quantile functions, this converts the problem into an exercise in numerical analysis. I wondered if this was the technique used in the R source code. It wasn’t so for either rgamma or rnorm. rnorm underlyingly calls this routine, which consists of almost 300 lines of C:

/*
 *  REFERENCE
 *
 *    Ahrens, J.H. and Dieter, U.
 *    Extensions of Forsythe's method for random sampling from
 *    the normal distribution.
 *    Math. Comput. 27, 927-937.
 *
 *    The definitions of the constants a[k], d[k], t[k] and
 *    h[k] are according to the abovementioned article
 */
double norm_rand(void)
{

The paper mentioned in the comment doesn’t appear to be available online, but it is summarized here. It goes back to an idea von Neumann had in the 1950s. The technique still involves multiple uniform distribution samples, which are apparently cheaper than evaluating a quantile function. It uses far fewer samples on average than the 30 I used in rnorm.

Here is a relevant Knuth quote:

It is conceivable that someday somebody will invent a random number generator that produces one of these other random quantities directly, instead of getting it indirectly via the uniform distribution. But no direct methods have as yet proved to be practical, except for the “random bit” generator described in Section 3.2.2.

The Art of Computer Programming: Seminumerical Algorithms, p. 119