Random Geekiness

March 10, 2008

Using Hadoop with Kids

Picture_3

I got a hit to my blog referred through the Google search "how to use hadoop with kfs."  The spelling correction is funny.  Anyone have kids playing with Hadoop?

(This highlights a problem with purely statistical approaches: "kids" appears more frequently in documents than "kfs," because it's a more common word; but documents that have "kfs" and "hadoop" are more likely what you want.)


June 08, 2007

Sentence does not parse

When you work around a bunch of linguists, suddenly language becomes a playground.  Last night, someone at the bar said to me, "I've never not been here on a Thursday night."  Instead of my gut-reaction to correct his grammar, something just didn't seem right about that utterance.  I've thought about it a bit today and I came up with two possible interpretations:

  1. "I come here every Thursday."
  2. "I have only visited to this bar on Thursday, not on other nights."

In (1), the speaker is trying to express that he frequents the bar every Thursday night.  In (2), the speaker is saying that, every time that he comes to a bar is on Thursday.  From the context of the conversation, it was clear to me that he meant (2).

While pondering this question last night over booze, my initial reaction was to figure out a logical representation of the sentences.  If x is a day, then:

  1. ∀(x) [ Thursday(x) ⇒ IamHere ]
  2. ∀(x) [ IamHere ⇒ Thursday(x) ]

Nice, but it still doesn't help the problem of figuring out which one he meant.  Translating English words like "never" and "always" into logic can be very difficult.  After talking to a bunch of linguistics PhDs here at Powerset, I've come to the conclusion that there's actually some subtlety in the original formulation itself:

  1. "I'm never not here on a Thursday"
  2. "I've never not been here on Thursdays."

Subtle differences in tense and number seem to suggest different senses for these extremely similar sentences.  Because of the context of the conversation, I had a big advantage in selecting the right interpretation. 

Such sentences that break down my advanced parsing engine make me appreciate the technology in Powerset even more.  In every day conversation, we understand sentences without having to consider their structure.  Ambiguity occurs, but we humans have an incredible ability to use context to interpret sentences.  But, we do all of this instinctively: when we have to look at the syntax of a sentence and ask why we make the interpretation we do, it's really, really difficult.  Encoding all of those "natural" rules requires more than a naive understanding of language.

Any guesses about how many numbers I got at the bar last night? =)