Z-Filters is a technique for listening to what millions of people are talking about.
If a hundred million people were to talk about a hundred million different things, making sense of it would be a hopeless task, but that’s not the way society operates. The number of things people are talking about at any given moment is much, much smaller— thousands, not millions—and by limiting results to subjects that are newly emerging or newly active, a comprehensive view of “what’s new” can be seen at approximately the data-flow of the Times Square news ticker.
Twitter is the data source we’re talking about here, but the idea isn’t Twitter specific. The same problems are inherent no matter how you eavesdrop on a population. Thousands of subjects at one time is far too much for a human to make sense of, but making sense of the data torrent is still possible because, while thousands of active subjects exist at any given time, the rate at which truly new subjects arrive turns out to be quite manageable. Depending on how you define ‘subject’, it’s only a few to a few tens per minute. It’s worth noting that newly emerging subjects are usually quite small—a handful of Tweets is often enough to define a new subject.
This is where ZFilters comes in. By restricting the view of the flow strictly to new subjects, it becomes possible to see every one, which is interesting in itself, but it also becomes possible to watch new subjects develop because each significant change to a subject makes it “new” again. A new subject that doesn’t evolve quickly ceases to be seen by the heuristic and fades from view, as it should, because any subject of genuine interest tends to be added to continually. The paired processes of identifying the new and automatically winnowing out the stale organizes the chaos of countless utterances into series of meaningful clumps that allow a human to comprehend what a huge mass of people are thinking and talking about in real time.
ZFilters can process the Twitter fire hose almost in real time—when processing the deca-hose (1/10 of the full firehose) the update cycle is faster than a person can type a Tweet or forward a re-Tweet. Counterintuitively, for the full firehose it is even lower.
The heuristic definition of a subject is fine-grained enough to catch subjects that have appeared in only a dozen or so Tweets, and follow them as they evolve and ramify into diverse new subjects in real time.
This requires some computational tricks, but equally important is carefully defining what it means to listen to millions of voices. The key insight is that if you restrict the view solely to looking through a keyhole only at subjects that new, the problem is reduced to human scale. In this way, you stretch out the view so that if you looked long enough you would see every subject, but on the leading edge, you only see what is new or fresh.
These notes describe applying Z-Filters to the Twitter fire hose, but the principle is not inherently Twitter-specific. It can be applied to any huge cacophony of voices, whether in real time or in exploring our digitally recorded past. Because the heuristic can in principle consume data much faster Twitter produces it, historical or social research can be performed on large time-spans that would be almost impossible to handle with analytics that run more slowly than real-time.
Twitter started out as a micro-blogging platform, i.e., a way for individuals to address groups of readers who explicitly follow them. But almost from the beginning people have wanted to go beyond that and use it more like broadcast radio to address the world at large. From the opposite side, the public would also like to be able to treat Twitter like a radio and browse through the stations looking for interesting content.
The trouble is that Twitter is like millions of radio stations with no program guide. Sure, it works like a broadcast medium for Donald Trump and Beyonce, but only because they have so many followers that the Tweets actually constitute conventional news. This allows other Tweeters and the public at large to find out about them. The Tweets of the less famous can occasionally bubble into public view if they are noticed and multiplied by enough re-Tweeters, but most Twitter subjects live out their existence unheard beyond a small bubble of followers. For this reason, Twitter’s potential as a broadcast medium tends to be limited by the fact that most genuine news isn’t Tweeted by mega-famous people—it’s Tweeted by near-unknown bystanders or participants, and thus never escapes its bubble.
The standard techniques used for analyzing Twitter—natural language processing (NLP) and frequency analysis—have some limitations when applied to the seemingly simple problem of noticing what’s new quickly.
As hinted above, there is a basic conceptual problem with the goal of finding out what millions of people are talking about: even if we could identify all of the subjects that active at a given moment, and cluster the Tweets around them, what good would it be? Depending upon how you define “subject” there are probably at least 10,000 or so current at any given moment. To put this in scale, the average adult only knows between 12,000 and 20,000 words—what insight would that giant glob of information provide?
A video relating to this project can be seen here in Google Drive. It has much the same material as these notes but shows the program at work.
How Twitter is Analyzed Today
Hundreds of millions of Tweeters produce something like 5000 Tweets per second, night and day—half a billion Tweets every 24 hours. It is no wonder that they call it “the fire hose.”
One way to think about the analytic techniques that are conventionally applied to this data is that they fall into two categories: those that gather general data to inform the general public and special purpose analysis used by business, political analysis, etc. They have very different goals. Analytics concerned with the latter tends to be more elaborately developed in part because paying customers know what they are interested in at least general terms. This kind of analysis tends to rely heavily on natural language processing (NLP).
General purpose analysis aimed at providing information to the public tends to ask about the very general question of “what’s going on.” This isn’t usually as useful for at least two reasons. The first is that the problem is so diffuse; what, exactly, does “what’s going on” mean when you’re talking about millions of people? The second problem is that at first glance, it would seem that any comprehensive answer would be overwhelmingly huge; at any one time there must be many thousands of topics people are Tweeting about.
Hashtags are a third path which we aren’t really concerned with here. Hashtags let users apply simple metadata to Tweets to make it possible to find Tweets on a single subject even though they originate from many users. This is useful, but both the originators and the public need to have a prior understanding of the tags. To find breaking news effectively, you need to instantly identify subjects without this kind of explicitly created metadata.
What Exactly Is a Subject?
Supreme Court Justice Potter Stewart said of obscenity: “I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description; and perhaps I could never succeed in intelligibly doing so. But I know it when I see it.”
Subjects are like that—hard to define, but you know them when you see them. But even without a good definition we can do a pretty good job of spotting them based on groupings of a certain category of words.
In principle, two people can talk about the subject without using any of the same words, but that is rare. In practice, clusters of shared words turn out to be a pretty good proxy for the abstraction of a subject. This makes sense because the overwhelming majority of words are extremely rarely used (see Zipf’s Law) and if you ignore the super-common words you find that subjects you recognize as such tend to share sets of unusual words. Most words have frequencies of less than one in the hundreds of thousands or even millions. If a handful of words that should show up at most once a day in the entire firehose suddenly start popping up together in the same Tweets in a small time interval, it’s almost never a coincidence.
Therefore, for our purposes, we can think of subjects as little rare-word clouds. Counter-intuitively, famous names and highly recognizable words actually matter very little when it comes to identifying news. “Bieber” and “Adelle” are of little more value than background words like “the” and “a” because they appear in millions of Tweets every day. But an obscure street name plus the words “egg” and “house” popping up together in a dozen Tweets, it signifies that something is up long before the name “Bieber” shows a significant blip in usage.
Before looking at how Z-Filters works, let’s look first at the strengths and weaknesses of NLP and relative frequency analysis (trending) and where they run into limits in identifying new subjects in a text stream such as Twitter.
Natural Language Processing (NLP)
NLP techniques are strongest in identifying and classifying Tweets relating to subjects that we know exist a-priori. If you had to sum up what it’s about in one word, it would be “recognition.” This doesn’t mean simply grepping for keywords, although regular expression matching is an important tool. In a typical NLP approach, the Tweet stream would be filtered using regular expressions and other tools to search for potentially interesting content, after which semantic analysis can be applied to the reduced result. With the semantics of the Tweets understood, the stream can be iteratively reprocessed to identify and classify still more related content that may not even use the keywords that flagged the Tweets for the first cut.
It isn’t just about keywords. With enough processing power, time and storage, you can take this basic approach to another level, because you can accumulate a body of data on word associations, phrases, etc. over time, which lets you compare usage at the moment to the usage that is statistically expected. Sudden anomalies in the typical pattern of word association can be treated as potential subjects to be analyzed further. Moreover, each Tweet comes packaged with a variable but extensive glob of metadata. For instance, seemingly minor blips in the association of clusters of words, which may not rise above the level of noise across the entire firehose may be startling when the surplus occurrences are clustered around some particular geographical location. Many of the other data points can be considered as well, including networks of followers, individual Tweeting histories, the origin of the Tweet, etc.
The most obvious limitation of this approach is that to be effective takes tremendous processing power. It also takes considerable time to develop the historical background data to make it effective for a given data source. It also takes a great deal of human ingenuity and expertise. An even more fundamental limitation may be in granularity. The size of the task increases with the granularity of the view, making it difficult and costly to identify subjects at speeds fast enough to expose new subjects within seconds.
Word Trending: Relative Frequency Analysis
“Now, exactly how many cattle are required for a stampede, Earl? Is it three or more? Is there a minimum to ‘pede?’” – Valentine McKee
Word trending based on frequency analysis is a simple way to get a glimpse of what the crowd is talking about. If a change in the frequency of usage for a word is large and increasing, the word is identified as “trending.”
Trending is a quick and easy, but it’s a crude tool. First of all, individual words aren’t subjects or even good proxies for subjects. Secondly, the most interesting news may not involve a very large number of Tweets—is it really a trend if it’s just a few dozen Tweets out of a relentless 5000/second stream?
Nevertheless, one can imagine it might be possible to use sets of trending words as a starting point for an NLP-based approach to trying to figure out what millions of people are talking about.
But there are snags. The shorter the interval between frequency computations, the finer the resolution for seeing high-frequency spikes in usage (short lived or abrupt surges) but the worse for detecting low-frequency, but possibly large surges. You could compare frequencies calculated at longer intervals too, but you must beware that meaningful frequency changes occur against a background of regular frequency changes that follow the sun, the work week, the cycle of holidays, etc. so your reference interval can’t be so long ago that it spans different parts of the day. For a fine-grained view of subjects that signify a new or evolving event, you would need to do frequency calculations almost continuously.
For at least two reasons, the cycle time between frequency analyses must be very small. First, for a fine-grained view of new and evolving subjects, you would need to do frequency calculations very frequently—ideally in something like the time it takes people to respond to a Tweet. Secondly, if we think of subjects as clusters of unusual words with highly correlated usage among a set of Tweets, computing the subjects requires correlation, and the cost of correlation grows quickly with the number of words and Tweets. The fire hose is 300,000 Tweets a minute—you can’t afford to let them pile up for very long if you have to do a correlation operation on them. The computations are all simple, but there’s a practical limit to how fast you can do them.
A Different Way to Look At It
The key insight is that a word’s frequency isn’t actually important. A better indicator is that a word’s frequency in increasing rapidly. They’re such closely related concepts that you can miss a key difference: a word’s unusually high relative frequency of usage can persist for any amount of time, even permanently, but frequency can only increase rapidly for a very limited time. In practice, the growth period for surges in word frequency is usually relatively short and follows an S-curve as it grows, with a period of rapid increase both preceded and followed by longer periods of slower growth or a static level of popularity.
Considering only the leading edge of a surge is powerful because it allows you to shrink the size of the problem by orders of magnitude. Say we choose a definition of busy that gives us 10,000 unusually busy words at any one time, and say the words stay abnormally busy for an average of a day. Ten thousand seems like a big number, but over a day it’s only 6.9 newly busy words per minute, which is one every 9.7 seconds! Consider also that if words age out of the rapid growth period in an average of say, one hour, in this hypothetical example there would only be 414 newly abnormally busy words to consider at one time.
Balancing this ruthless automatic pruning of the current set of busy words is the empirical observation that significant subjects invariably bring in a broader commentary, with new relevant words that tend to keep a set of Tweets on an important surging subject visible long after the words that originally got the subject noticed have peaked in frequency and aged out. As long as people keep saying new things about a subject, the subject will tend to stay visible.
The shorter the compute cycle, the smaller the number of Tweets among which the busy words must be correlated and the sharper the granularity of observation. If you can do the computation faster than the typical time it takes a person to Tweet a response, you’re into the domain of real time, indeed, you could even respond to or otherwise interact with them.
At this point, the reader might reasonably object that it sounds as if we’ve made the problem even harder—now have to track not only word frequencies but also the rate at which each word frequency is changing over time. However, that is not the case. Surprisingly, it is possible to detect changes in word frequency many time faster than you can compute the word frequency itself. Moreover, you can do this without tracking the previous frequencies of individual words. In fact, while you do need to know the background frequencies of words, say, over timespans of an hour or two, you can spot changes in frequency in the range of a few seconds without any finer grained word-counting.
The Trick: Finding the Busy Words
Rapidly computing the set of words that are on the leading edge of a surge in usage is the interesting part of this process. The rest of the processing involved is very standard—the I/O, JSON parsing, correlations, clustering, etc.—so we won’t talk much about those details here.
The trick to quickly computing the set of words that are newly surging in frequency of usage is to not count words. Word counting for frequency analysis is laborious and memory-intensive because everything you do or store is multiplied by the cardinality of the universe of words (which is something like 20 million depending on what you call a word).
Instead of tracking individual words, we use a data structure that partitions the universe of words into pseudo-random subsets, or equivalence classes, each of which conflates many thousands of words. The words in a class are chosen deterministically, but they appear to be randomly selected. It is not individual words in the Tweet stream that are counted, but instances of any word in each of this limited number of equivalence classes. We do this three times, each time with a different pseudo-random partitioning of the words, and from the evolving statistics of the counters in the resulting data structure, we can infer the words that are momentarily abnormally busy without ever actually counting individual words.
More on how this works below; the important thing is that by dealing with equivalence classes of M words, instead of the words themselves, the number of data points we must deal with is proportionately smaller by 1/M. As M is typically in the thousands, we’ve shrunk the problem enormously.
Dealing with these equivalence classes makes the process extremely fast. Even with the full fire hose, the limiting factor of the core heuristic is waiting for sufficiently many fresh Tweets to produce a meaningful change the current picture. For the decahose, the update period is a few seconds. For the full fire hose, the update period a second or less. At current Twitter volumes, you’d have to process historical data in order to consume Tweets fast enough to be limited by the speed of the core heuristic.
The following outlines the principle, starting at the point after the incoming Tweet JSON has been parsed and the words extracted.
- The number of equivalence classes, M, is chosen.
- Each word will be represented by a key consisting of a three pseudo-random integers in the range of [0,M].
- Processing proceeds on three parallel paths, each of which is centered around an array of M counters.
- A compute cycle consists of N Tweets.
For each Tweet in the stream
- Each incoming word in the Tweet is mapped to a three-part key.
- Each of the three elements of a key is computed by a different hash function applied to the word.
- The first time a given word is encountered, the word/triple pair is saved in a database or lookup table.
- The existence of the word is also noted in global Bloom filter so that you can quickly determine whether a word has been seen before and avoid the cost of repeatedly generating the same key.
- Each of the three positions in a key corresponds to a different processing path.
- For each of the three paths, the counter indexed by the value of the corresponding position in the tuple (modulo the size of the counter array) is incremented.)
- This is done for every word in the N Tweets in the cycle.
- As words from new Tweets are processed, words from old Tweets are aged out, so the data structure represents only the last N Tweets.
- At the end of each cycle, each of the three sets of counters values is analyzed to discover the set of indices of counters with anomalously high counts.
- Because the assignment of words to counters is pseudo-random and independent, the probability that a counter value occurred by random chance is easily computed as a Z-score (see any statistics book for a definition.)
- A high Z is chosen as a cutoff for what we regard as surging.
- The Cartesian product of the three sets of exceptionally high indexes is computed with each element of the result being a triple-key.
- This product will contain the triples that correspond to the busy words.
- The product will also contain a great many more spurious triples, most of which correspond to nothing because the key-space is many times larger than the total number of words that exist. These spurious keys are easy to filter out using the same Bloom filter that is used to detect novel words.
- The remaining valid triples are used to look up the corresponding words in a conventional database—there are only tens or hundreds, so the lookup takes a split second.
Like magic, we’ve reconstituted the set of anomalously busy words probabilistically without actually counting any particular word! Up to a very high rate, the computation actually speeds up as the Tweet arrival rate grows because an interval consists of a fixed number of Tweets, not a wall-clock time.
By adjusting the cutoff Z-score, the number of counters that are considered significant can be throttled down to a level that suits our human idea of signal versus noise. If you set Z too low there will be increased false positives and more marginal subjects that come and go in just a few seconds. If you set Z too high, you’ll sharpen the view of subjects, but it will more Tweet activity to make a subject visible. Adjust to taste.
If you’be been following along, you may object that this scheme won’t work because word frequencies are extremely variable (Zipf’s Law again), and you would be correct. I’ve over-simplified. What actually happens is that the incoming word stream is split into f frequency classes, each representing an approximately equal numbers of word occurrences (not distinct words) in each class. The words within each of the classes have approximately equal frequency over a long time interval. Each of the frequency classes is processed separately as described above, in parallel, and the results merged at the end. If you have F frequency classes (say, six or seven) you will have 3F total processing paths. Note that this multiplies the amount of work by 3F, but those pieces are independent an can execute in parallel on a multi-core machine, so it’s effectively only a small increase in time to process.
Splitting the stream is cheap and requires only an occasional global frequency calculation to keep the frequency data fresh. The relative frequencies are used to build a set of F Bloom filters to split the stream f ways. New words that come along default to the lowest frequency class until the next frequency computation assigns them to a class. The frequency filters are computed offline, and thus do not contribute significantly to the work that must be done serially.
The process of identifying the words is probabilistic but surprisingly accurate. Because the words in each stream are chosen to be of approximately the same background frequency, it is extremely unlikely that an index will exceed an appropriately chosen Z cutoff (which in fact is what Z-score means!)
Note spurious words are occasionally generated but this is usually harmless, because the chances are high that a randomly chosen word will not appear in any of an interval’s Tweets. This is because, while there are tens of millions of words, only a few thousand distinct words will be used in any few second’s Tweets. If it does appear, so what? A subject is typically characterized by several obscure words that are suddenly now so obscure. A spurious additional word or two matters little.
Identifying the Subjects and Clumping the Tweets
Processing intervals are typically only ten thousand or so Tweets, which can be stored in-memory in just a few megabytes. Therefore you can easily keep the most recent several intervals of Tweets in memory for fast analysis.
With the current busy words for an interval in hand:
- All Tweets for the interval that do not contain any busy words are set aside.
- The busy words are correlated by their co-appearance in the current interval’s Tweets.
- Busy words constitute a minute percentage of all words, so the appearance of even one busy word in a Tweet is significant.
- The appearance of two or more words in the same Tweet makes it almost certain that the words are associated with the same subject.
- The Tweets containing busy words are clustered according to the appearance of subsets of the busy words using a standard clustering algorithm.
- Optionally, Tweets from the preceding interval or intervals can also be scanned for words that had not yet made the grade but subsequently did. This is valuable for finding the origin of a subject because the subject may exists as only a single Tweet, or a few Tweets for multiple cycles before it can be identified.
Tweet clumps often have overlapping sets of busy words because what a person would judge to be “a subject” is several overlapping subjects as seen by the heuristics. What one typically finds is that the overlapping clusterings show different aspects an ongoing conversation. Note that the clustering heuristic is also probabilistic. The GUI takes advantage of this by allowing you to instantly generate new randomized clusterings on demand. With a well-tuned setup, the clusterings tend to be pretty similar.
In practice, this works extremely well. Over the course of a football game or other mass event, the high points can easily be recognized in the Twitter stream sprinkled among many other subjects.
Speed is an interesting issue because, over reasonable volumes, the algorithm is primarily limited by waiting for input, not by the core processing. There have to be enough fresh Tweets to make a meaningful update.
The demo, processing a single deca-hose, runs in about 10 seconds per tick, which is about 5000 Tweets. Increasing the input rate drops the processing time linearly until the machine can no longer keep up with all the work of decompressing, parsing the Tweets from the original JSON, running the Web server and GUI for the output, periodic global word frequency calculations, hosting the database, etc. Running as fast as it can the demo updates every second or two.
Note that the demo is not distributed, yet is capable of processing up to about a half a firehose worth of Tweets from raw JSON.
A more industrial-strength implementation would offload most of this processing to separate machines. The core algorithm running on its own machine could easily handle many times the full fire hose volume, and distributing the core algorithm on multiple machines could make it faster still.
One of the most interesting possible uses for all that speed is processing historical Tweets for sociological or historical analysis. In fact, analysis of the information flow during the events of the Arab Spring of 2010-2011 was an early motivating idea. This would make it possible see the step-by-step evolution of the mass dynamics
A Major Event
One notable example of subject-discovery in the two-week sample data set used in the demo relates to the death of the singer Whitney Houston. A video of the GUI running at this time can be seen here. The algorithm picked up on the event within about a dozen tweets—barely time for re-Tweeting to have gotten under way.
Significantly, neither “Whitney” nor “Houston” was among the early words that stimulated notice of the Tweet subject. The two names are moderately common words that normally appeared (until then) a few dozen times a day, together and/or separately. The words that first surged beyond the possibility of random chance were the names of her publicist, who was one of the first to Tweet the news. Other early subjects included “bath” and similar words associated with the early rumors. This is the expected behavior because famous names appear all the time, so it takes a large number of occurrences to constitute an unambiguous surge. This is typical algorithm behavior. Surges of usage of unusual names and words are much more likely to signify something new and interesting than are minor wiggles in the frequency of more common names and words.
The two illustrations show different views of this sad but interesting moment in Twitter history. Seconds earlier, the new talk was (and had been for many minutes) about a Lion King special, a South American band, and a number of other ephemeral subjects. You can see in the screenshot just above the minute when news of the singer’s death explodes. The graph covers about six minutes of the feed.
Bubbles: The bubble view (at the beginning of this article) shows the current 140 seconds of busy words.
- The lava-lamp on the right represents many words relating to Whitney Houston but also words from other subjects that have not disappeared.
- The intensity of color indicates a word’s frequency class.
- The position from right to left indicates how long it has been since a given word was detected surging. All the way on the right indicates that a word is surging in the latest interval. If a word somewhere to the left suddenly surges again, it will pop back over to the far right.
- The size of the bubble indicates the degree of abnormality of usage adjusted for frequency class. (seven frequency classes were used in this run).
- The text output shows sample tweets for each subject ordered by how long the subject has been surging (as opposed to merely existing.) Tweets about the singer, though overwhelmingly common, are not yet at the top in this display because the event just happened, so they have only been surging at most about three minutes.
Main Screen: This view shows an elapsed time view of the last six minutes at about the same time. The moment where the lines erupt is only about six seconds after the first Tweets about the singer’s death. (This is not evident in the output, but was verified by a utility that scans the Tweet stream by brute force.)
- The X axis represents time, covering about six minutes.
- The Y axis represents the nominal Z-score
- The lines represent the surging of individual words over time in terms of the nominal Z score. This is not a word’s frequency of usage, but the improbability of the first derivative of that frequency occurring by random chance. Interpret it as the steepness of increase.
The earliest Tweets about the event showed the significant words being Foster, publicist, and Kristin because even a dozen or so instances of these words within a few seconds represents a huge increase in usage.
What’s It Good For?
Simple though it is, the Z-filters approach makes it possible discover what is happening globally across the Twittersphere at a fine grain, opening more broadcast-like mode of communication within Twitter. Being able to watch the new topics scroll by allows users to drop in on anything that looks interesting as it happens.
News that takes hours to show up in conventional media often shows up on Twitter in seconds or minutes. This suggests applications in entertainment and the financial industry.
It’s not just for real-time. The same problems that limit the applicability of frequency analysis for real-time also limit its applicability to long-term historical analysis. Even with historical data, you have to be able to process at least as fast as real time if you don’t want to fall farther and farther behind forever. The combination of fine-grain and fast processing would make it a uniquely powerful research tool for analyzing public sentiment in huge data sets over time, for instance, the entire fire hose since 2006.
Because it can handle such large data flows, the tool could be used to get a sense of what entire populations are thinking both in ordinary times and in moments of crisis or mass disaster. By feeding the communications of millions into such a system, it is possible to monitor in real time the complex ways in which a population is reacting to an ongoing situation without violating individual privacy.