algorithms, not-hadoop, twitter, Uncategorized

Z-Filters: How to Listen to a Hundred Million Voices

Z-Filters is a technique for listening to what millions of people are talking about.

If a hundred million people were to talk about a hundred million different things, making sense of it would be a hopeless task, but that’s not the way society operates. The number of things people are talking about at any given moment is much, much smaller: thousands, not millions.

bubble-view.png

Thousands of subjects at one time are still far too much for a human to make sense of, but making sense of the data torrent is still possible because, while thousands of subjects may exist, on average they persist for quite a while. This means that the rate at which new subjects arrive turns out to be quite manageable, more like several a minute. This is where ZFilters come in. If you can identify the new subjects quickly enough, as they first pop up, it becomes possible not only to know what they all are, which is interesting in itself, but to watch them develop in the context in which they begin and evolve. You can organize the chaos of countless utterances into meaningful clumps in real time and actually observe what a huge mass of people are thinking and talking about at a fine grain.

ZFilters can do this with the Twitter fire hose almost in real time—the update cycle is faster than a person can type a Tweet or forward a re-Tweet—and at a fine enough grain that you can watch a new Twitter subject, which may have appeared in only a dozen or so Tweets, as it evolves and ramifies into diverse new subjects in real time.

There are some computational tricks involved, but just as important is defining precisely what the goal is, i.e., what it means to listen to so many voices.  With the insight that you only have to discover what’s new, the problem becomes much less hopeless than it sounds.

These notes describe applying Z-Filters to the Twitter fire hose, but Z-Filters is not inherently Twitter-specific. It is intended as a way to listen to any huge cacophony of voices, whether in real time or in exploring our digitally recorded past.

Twitter

Twitter started out as a micro-blogging platform, i.e., a way for individuals to address groups of readers who explicitly follow them. But almost from the beginning people have wanted to go beyond that and use it more like broadcast radio to address the world at large. From the opposite side, the public would also like to be able to treat Twitter like a radio and browse through the stations looking for interesting content.

The trouble is that Twitter is like millions of radio stations with no program guide. Sure, it works like a broadcast medium for Donald Trump and Beyonce, but only because they have so many followers that their Tweets themselves constitute conventional news. This allows other Tweeters and the public at large to find out about them, often in part by way of a detour through conventional media. The Tweets of the less famous can likewise also sometimes function like broadcasts if they are noticed and multiplied by enough re-Tweeters, but most Twitter subjects live out their existence unheard beyond a small bubble of followers. Its potential as a broadcast medium is limited by the fact that most genuine news isn’t Tweeted by mega-famous people—it’s Tweeted by near-unknown bystanders or participants, and thus never escapes its bubble.

What we’d like to do is become aware of each new subject that a significant number of people are Tweeting about as it appears, or within a few Tweets. By “significant” we mean tens or hundreds, not millions.

Clearly, identifying subjects and news from the fire hose requires machine help, but the standard techniques used for analyzing Twitter—natural language processing (NLP) and frequency analysis have some limitations when applied to the seemingly simple problem of noticing what’s new quickly.

That is not to say that NLP is not extremely powerful at some aspects of analysis such as identifying information related to subjects that are of a priori interest. With even a vague idea of what one might be listening for, NLP techniques can effectively sniff out the faintest whiff of related Tweets and then work backward and laterally to sharpen their perception.
What NLP techniques are less efficient at is quickly identifying subjects and news that nobody is looking for. They can do it, but it’s a heavy-weight operation requiring a great deal of processing power as well as a lot of accumulated fine-grained knowledge about how language is used.

 A Problem

As hinted above, there is a basic conceptual problem with the goal of finding out what millions of people are talking about: even if we could identify all of the subjects that active at a given moment, and cluster the Tweets around them, what good would it be? There are probably 10,000 or so subjects current at any given moment. To put this in scale, the average adult only knows between 12,000 and 20,000 words—what insight would that giant glob of information provide?

Punting on that problem in favor of identifying only the new subjects radically shrinks the computational task and at the same time, delivers the information in a much more useful form. Seeing all new subjects means that eventually you see all subjects and you see them when they are most useful. Moreover, the arrival rate for new subjects turns out to be quite small, making the result is more manageable to human users.
Unfortunately, it sounds at first like a harder problem—after all, how do you know a subject is new if you haven’t already identified all the old subjects? Actually, you can.
Before explaining, let’s take a quick look at how Tweets are analyzed today.

A video relating to this project can be seen here in Google Drive. It has much the same material as these notes but shows the program at work.

How Twitter is Analyzed Today

Hundreds of millions of Tweeters produce something like 5000 Tweets per second, night and day—half a billion Tweets every 24 hours. It is no wonder that they call it “the fire hose.

Much of the fire hose consists of disconnected and meaningless utterances, but we’re interested only in the substantial fraction that relates to recognizable, coherent subjects. Suddenly, a joke will be re-Tweeted hundreds of times, kicking off succeeding waves of related jokes and comments over a period of minutes or hours. Rumors abound: Adele dies about once a week in the Twittersphere, and each of her deaths is followed by waves of denial and confirmation. Commentary follows every surprise ending of a TV show, spectacular sports play or hit song at a concert. Intriguingly, genuine news often breaks hours earlier on Twitter than in legitimate news services, although this is rarely discovered until long after the news item has appeared in conventional media.

What Exactly Is a Subject?

Supreme Court Justice Potter Stewart said of obscenity: “I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description; and perhaps I could never succeed in intelligibly doing so. But I know it when I see it.”

Subjects are like that—hard to define, but you know them when you see them. In principle, two people can talk about the subject without using any of the same words, but that’s rare. In practice, clusters of shared words turn out to be a very good proxy for the abstraction of a subject.

That clusters of shared words are a good proxy makes sense. The overwhelming majority of words are extremely rarely used (see Zipf’s Law.) If a few words that are normally rare are suddenly a little less rare, and you find them popping up together in the same Tweets, it’s unlikely to be a coincidence. For our purposes, think of subjects as little rare-word clouds. Famous names and highly recognizable words actually matter very little. When it comes to identifying news, “Bieber” and “Adelle” are of little more value than background words like “the” and “a.” What actually identifies meaningful subjects tends to be the rarely used words. An obscure street name and the words “egg” and “house” popping up together anomalously tells you that something is up long before the name “Bieber”  shows a significant blip in usage.

Before looking at how it works, let’s look first at the strengths and weaknesses of  NLP and relative frequency analysis (trending) and where they run into limits in identifying new subjects in a text stream such as Twitter.

Natural Language Processing (NLP)

NLP techniques are strongest in identifying and classifying Tweets relating to subjects that we know exist apriori. If you had to sum up what it’s about in one word, it would be “recognition.” This doesn’t mean just grepping for keywords, although regular expression matching is an important tool. In a typical NLP approach, the Tweet stream would be filtered using regular expressions and other tools to search for potentially interesting content, after which semantic analysis can be applied to the reduced result. With the semantics of the Tweets understood, the stream can be iteratively reprocessed to identify and classify still more related content that may not even use the keywords that flagged the Tweets for the first cut.

It isn’t just about keywords. With enough processing power, time and storage, you can take this basic approach to another level, because you can accumulate a body of data on word associations, phrases, etc. over time, which lets you compare usage at the moment to the usage that is statistically expected. Sudden anomalies in the typical pattern of word association can be treated as potential subjects to be analyzed further. Moreover, each Tweet comes packaged with a variable but extensive glob of metadata. For instance, seemingly minor blips in the association of clusters of words, which may not rise above the level of noise across the entire firehose may be startling when the surplus occurrences are clustered around some particular geographical location. Many of the other data points can be considered as well, including networks of followers, individual  Tweeting histories, the origin of the Tweet, etc.

The most obvious limitation of this approach is that to be effective takes tremendous processing power. It also takes considerable time to develop the historical background data to make it effective for a given data source.  It also takes a great deal of human ingenuity and expertise.  A more fundamental limitation may be in granularity. The size of the task increases with the granularity of the view, making it difficult to identify subjects at speeds comparable to the time it takes to compose a Tweet.

Word Trending: Relative Frequency Analysis

Now, exactly how many cattle are required for a stampede, Earl? Is it three or more? Is there a minimum to ‘pede?’”  – Valentine McKee

Word trending based on frequency analysis is a simple way to get a glimpse of what the crowd is talking about. If a change in the frequency of usage for a word is positive and large, the word is identified as “trending.”

Trending is a quick and easy, but it’s a crude tool. First of all, individual words aren’t subjects. They aren’t even a good proxy. Secondly, the most interesting news may not involve a very large number of Tweets—is it really a trend if it’s just a few dozen Tweets out of a relentless 5000/second stream?

Nevertheless, one can imagine it might be possible to use sets of trending words as a starting point for an NLP-based approach to trying to figure out what millions of people are talking about.

But there are snags. The shorter the interval between frequency computations, the finer the resolution for seeing high-frequency spikes in usage (short lived or abrupt surges) but the worse for detecting low-frequency, but possibly large surges. You could compare frequencies calculated at longer intervals too, but you must beware that meaningful frequency changes occur against a background of regular frequency changes that follow the sun, the work week, the cycle of holidays, etc. so your reference interval can’t be so long ago that it’s at a significantly different time of day.  For a fine-grained view of subjects that signify a new or evolving event, you would need to do frequency calculations almost continuously.

For at least two reasons, the cycle time between frequency analyses must be very small. First, for a fine-grained view of new and evolving subjects, you would need to do frequency calculations very fast—ideally in something like the time it takes people to respond to a Tweet. Secondly, subjects are clusters of words with highly correlated usage in the Tweets, and the cost of correlation grows quickly with the number of words and Tweets. The fire hose is 300,000 Tweets a minute—you can’t afford to let them pile up for long. The computations are all simple, but you can only do them but so fast.

A Different Way to Look At It

The key insight is that a word’s frequency isn’t important. A better indicator is that a word’s frequency in increasing rapidly. They’re such closely related concepts that you can miss the key difference: a word’s unusually high relative frequency of usage can persist any amount of time, but its frequency can only increase rapidly for a very limited time. In practice, the growth period for surges in word frequency is usually relatively short and follows an S-curve as it grows, with a period of rapid increase both preceded and followed by longer periods of slower growth.

This insight is powerful because it allows you to shrink the problem by orders of magnitude. Say we choose a definition of busy that gives us 10,000 unusually busy words at any one time, and they stay abnormally busy for an average of a day. This means that words are only qualifying as busy at a rate of 6.9 new words per minute, which is one every 9.7 seconds. Consider that if words age out of the rapid growth period in say, an hour, in this hypothetical example there would only be 414 words to consider at one time.

Balancing this ruthless automatic pruning of the current set of busy words is the empirical observation that significant subjects invariably bring in a broader commentary, with new relevant words that tend to keep Tweets on an important surging subject visible long after the words that originally got the subject noticed have peaked or plateaued in frequency and aged out. As long as people keep saying new things about a subject, the subject will tend to stay visible.

The shorter the compute cycle, the smaller the number of Tweets among which the busy words must be correlated and the sharper the granularity of observation. If you can do the computation faster than the typical time it takes a person to Tweet a response, you’re into the domain of real time, indeed, you could even respond to or otherwise interact with them.

At this point, the reader might reasonably object that it sounds as if we’ve made the problem even harder—now have to track not only word frequencies but also the rate at which word frequencies are changing over time. But that is not the case. You can detect changes in word frequency many time faster than you can compute the word frequency itself, and you can do it without tracking the previous frequencies, or indeed, without frequent computation of individual word frequencies.

The Trick

Computing the set of words that are on the leading edge of a surge is the interesting part of this process. The rest of the processing involved is very standard—the I/O, JSON parsing,  correlations, clustering, etc.—so we won’t talk much about those details here.

The trick to quickly computing the set of newly surging word frequencies is to not count words. Word counting for frequency analysis is laborious and memory-intensive because everything you do or store is multiplied by the cardinality of the universe of words (which is something like 20 million depending on what you call a word).

Instead of tracking individual words, we use a data structure that partitions the universe of words into pseudo-random subsets, or equivalence classes, each of which conflates many thousands of words. It is not individual words in the Tweet stream that are counted, but instances in each of this limited number of equivalence classes.  We do this three times, with a different pseudo-random partitioning for each, and from the evolving statistics of the counters in the resulting data structure, we can infer the words that are momentarily abnormally busy without ever actually counting individual words.

More on how this works below; the important thing is that by dealing with equivalence classes of words, instead of the words themselves, the number of data points we must deal with is proportionately smaller.

Both updating the set counts and extracting the current moment’s newly surging words from this data structure are fast. Even with the full fire hose, the first factor that limits the speed of the core heuristic is waiting for sufficiently many fresh Tweets to produce a meaningful change the current picture. For the decahose, the update period is a few seconds. For the full fire hose, the update period is sub-second.

More Detail

The following outlines the principle, starting at the point after the incoming Tweet JSON has been parsed and the words extracted.

  • The number of equivalence classes, M,  is chosen.
  • Each word will be represented by a key consisting of a three pseudo-random integers in the range of [0,M].
  • Processing will occur on three parallel paths, each of which is centered around an array of M counters.
  • A compute cycle consists of N Tweets.

For each Tweet in the stream

  • Each incoming word in the Tweet is mapped to a triple key.
    • Each element of the triple for a word is computed by a different hash function applied to the word.
    • The first time each word is encountered, the word/triple pair is saved in a database.
    • The existence of the word is also noted in global Bloom filter so that you can instantly determine whether a word has been seen before.
  • Each of the three positions in a key corresponds to a different processing path.
  • For each of the three paths, the counter indexed by the value of the corresponding position in the tuple  (modulo the size of the counter array) is incremented.)
    • This is done for every word in the n Tweets in the cycle.
    • As words from new Tweets are processed, words from old Tweets are aged out, so the data structure represents only the last N Tweets.
  • At the end of each cycle, each of the three sets of counters values is analyzed to discover the set of indices of counters with anomalously high counts.
    • Because the assignment of words to counters is pseudo-random and independent, the probability that a counter value occurred by random chance is easily computed as a Z-score (see any statistics book for a definition.) 
    • A high Z is chosen as a cutoff for what we regard as surging.
  • The Cartesian product of the three sets of exceptionally high indexes is computed, resulting in a large number of triple-keys.
    • This product will contain the triples that correspond to the busy words.
    • The product will also contain a great many more spurious triples, most of which correspond to nothing.  but these are easy to filter out using the same Bloom filter that is used to detect novel words.
    • The valid triples are used to look up the words in a conventional database—there are only tens or hundreds, so the lookup takes a split second.

Like magic, we’ve reconstituted the set of anomalously busy words probabilistically without actually counting any particular word! The computation actually speeds up as the Tweet rate grows because an interval consists of a fixed number of Tweets, not a wall-clock time.

By setting a cutoff Z-score sufficiently high, the number of counters that are considered significant can be throttled down to a level that suits our human idea of signal versus noise. If you set Z too low there will be a lot of false positives and marginal subjects that come and go in just a few seconds. If you set Z too high, it will take an undue amount of activity to make a subject visible. Adjust to taste.

Objection!

If you’be been following along, you may object that this scheme won’t work because word frequencies are extremely variable (Zipf’s Law again), and you would be correct. What actually happens is that the incoming word stream is split into f frequency classes representing equal numbers of word occurrences (not distinct words) in each class. The words within each of the classes have approximately equal frequency over a long time interval. Each of the frequency classes is processed separately as described above, in parallel, and the results merged at the end.  If you have F frequency classes (say, six or seven) you will have 3F total processing paths.

Splitting the stream is cheap and requires only an occasional global frequency calculation to keep the frequency data fresh. The relative frequencies are used to build a set of Bloom filters to split the stream f ways. New words that come along default to the lowest frequency class until the next frequency computation assigns them to a class. The frequency filters are computed offline.

The process of identifying the words is surprisingly accurate. Because the words streams are structured in advance to be of approximately the same frequency, it is extremely unlikely that an index will exceed an appropriately chosen Z cutoff (that’s what a Z-score means!)

When a spurious word is occasionally generated, it is usually harmless, because the chances are overwhelming that a randomly chosen word will not appear in any of an interval’s Tweets. This is because, while there are tens of millions of words, only a few thousand distinct words will be used in any few second’s Tweets, so in most case a spurious word will appear in few or none of the current N Tweets.

Identifying  the Subjects and Clumping the Tweets

Processing intervals are typically only ten thousand or so Tweets, which can be stored in-memory in just a few megabytes. Therefore you can easily keep the most recent several intervals of Tweets in memory for fast analysis.

With the current busy words for an interval in hand:

  • All Tweets for the interval that do not contain any busy words are set aside.
  • The busy words are correlated by their co-appearance in the current interval’s  Tweets.
    • Busy words constitute a minute percentage of all words, so the appearance of even one busy word in a Tweet is significant.
    • The appearance of two or more words in the sameTweets makes it almost certain that the words are associated with the same subject.
  • The Tweets containing busy words are clustered according to the appearance of subsets of the busy words using a standard clustering algorithm.
  • Optionally, Tweets from the preceding interval or intervals can also be scanned for words that had not yet made the grade but subsequently did.  This is valuable because the subject may exists as only a single Tweet, or a few Tweets for multiple cycles before it can be identified.

Tweet clumps often have overlapping sets of busy words because “a subject” is often actually several related subjects. The GUI provides a way to generate new randomized clusterings on demand. What one finds is that the overlapping clusterings show different aspects an ongoing conversation.

In practice, this works extremely well. Over the course of a football game or event, the high points can easily be recognized in the Twitter stream sprinkled among many other subjects.

Speed

Speed is an interesting issue because, over reasonable volumes, the algorithm is primarily limited by waiting for input, not by the core processing. There have to be enough fresh Tweets to make a meaningful update.

The demo, processing a single decahose, runs in about 10 seconds per tick, which is about 5000 Tweets. Increasing the input rate drops the processing time linearly until the machine can no longer keep up with all the work of decompressing, parsing the Tweets from the original JSON, running the Web server and GUI for the output, periodic global word frequency calculations, hosting the database, etc.  Running as fast as it can the demo updates every second or two.

Note that the demo is not distributed, yet is capable of processing up to about a half a firehose worth of Tweets from raw JSON.

A more industrial-strength implementation would offload most of this processing to separate machines. The core algorithm running on its own machine could easily handle many times the full fire hose volume. Distributing the core algorithm on multiple machines could make it faster still. An implementation for a high-volume text feed, for instance, all Facebook postings, or all Google searches, or all of Gmail is feasible.

A Major Event

One notable example of subject-discovery in the two-week sample data set used in the demo relates to the death of the singer Whitney Houston. A video of the GUI running at this time can be seen here. The algorithm picked up on the event within about a dozen tweets—barely time for re-Tweeting to have gotten under way.

Significantly, neither “Whitney” nor “Houston” was among the early words that stimulated notice of the Tweet subject. The two names are moderately common words that normally appeared (until then) a few dozen times a day, together and/or separately. The words that first surged beyond the possibility of random chance were the names of her publicist, who was one of the first to Tweet the news. Other early subjects included “bath” and similar words associated with the early rumors. This is the expected behavior because famous names appear all the time, so it takes a large number of occurrences to constitute an unambiguous surge. This is typical behavior. Surges of usage of unusual names and words are much more likely to signify something new and interesting than are more often used names and words.

main-screen.pngThe two illustrations show different views of this sad but interesting moment in Twitter history. Seconds earlier, the new talk was (and had been for many minutes) about a Lion King special, a South American band, and a number of other ephemeral subjects. You can see in the screenshot just above the minute when news of the singer’s death explodes. The graph covers about six minutes of the feed.

Bubbles: The bubble view (at the beginning of this article) shows the current 140 seconds of busy words.

  • The lava-lamp on the right represents many words relating to Whitney Houston but also words from other subjects that have not disappeared.
  • The intensity of color indicates a word’s frequency class.
  • The position from right to left indicates how long it has been since a given word was detected surging. All the way on the right indicates that a word is surging in the latest interval. If a word somewhere to the left suddenly surges again, it will pop back over to the far right.
  • The size of the bubble indicates the degree of abnormality of usage adjusted for frequency class. (seven frequency classes were used in this run).
  • The text output shows sample tweets for each subject ordered by how long the subject has been surging (as opposed to merely existing.) Tweets about the singer, though overwhelmingly common, are not yet at the top in this display because the event just happened, so they have only been surging at most about three minutes.

Main Screen: This view shows an elapsed time view of the last six minutes at about the same time.  The moment where the lines erupt is only about six seconds after the first Tweets about the singer’s death. (This is not evident in the output, but was verified by a utility that scans the Tweet stream by brute force.)

  • The X axis represents time, covering about six minutes.
  • The Y axis represents the nominal Z-score
  • The lines represent the surging of individual words over time in terms of the nominal Z score. This is not a word’s frequency of usage, but the improbability of the first derivative of that frequency occurring by random chance. Interpret it as the steepness of increase.

The earliest Tweets about the event showed the significant words being Foster, publicist, and Kristin because even a dozen or so instances of these words within a few seconds represents a huge increase in usage.

What’s It Good For?

Simple though it is, the Z-filters approach makes it possible discover what is happening globally across the Twittersphere at a fine grain, opening more broadcast-like mode of communication within Twitter. Being able to watch the new topics scroll by allows users to drop in on anything that looks interesting as it happens.

News that takes hours to show up in conventional media often shows up on Twitter in seconds or minutes. This suggests applications in entertainment and the financial industry.

It’s not just for real-time. The same problems that limit the applicability of frequency analysis for real-time also limit its applicability to long-term historical analysis. Even with historical data, you have to be able to process at least as fast as real time if you don’t want to fall farther and farther behind forever. The combination of fine-grain and fast processing would make it a uniquely powerful research tool for analyzing public sentiment in huge data sets over time, for instance, the entire fire hose since 2006.

Because it can handle such large data flows, the tool could be used to get a sense of what entire populations are thinking both in ordinary times and in moments of crisis or mass disaster. By feeding the communications of millions into such a system, it is possible to monitor in real time the complex ways in which a population is reacting to an ongoing situation without violating individual privacy.

Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s