<There are NOT millions of Twitter users in China: Supporting @ooof’s result and refuting GWI’s conclusion>
The question of how many Chinese Twitter users there are made headlines a few months back when the market research company GlobalWebIndex published results from a survey which claimed that 35 million people in China used Twitter. Media outlets ran with the story of how there was a huge secret upswell in “free” netizens in China who climbed the Great Firewall to access blocked sites like Twitter, with the seeming implication being that revolución! was just around the corner. Social/human rights progress may still indeed take place in China in the near future, but most smart social media watchers agree it won’t be because of Twitter: Chinese folks just aren’t on the service in the same numbers that they are on other local social media sites like Sina Weibo, RenRen, and even upstart mobile apps like WeChat/Weixin. People (and even companies in advertisements) don’t pass around their Twitter handle in the same frequencies as they share their Weibo contact info.
Even if our eyes told us that Twitter seemed to have attracted an active but small group of activists in China—but not many others in the country—was there a possibility that we were all missing something? Was there really a secret group of Chinese Twitter users being overlooked? Fortunately, after this week, I hope we can finally dismiss GWI’s 35 million number once and for all. Inspired by an SCMP story detailing the findings of the Chinese Twitter user @ooof (h/t Steven Millward of Tech In Asia)—who cleverly used data on the website Twiyia.com to conclude that roughly 18,000 people who posted a tweet in Chinese selected Beijing as their home timezone—this weekend I performed a similar test using publicly available tweets on Twitter utilizing its API. According to the data I extracted, there are most likely tens of thousands of Twitter users in China, not millions as claimed by GWI, a result that confirms @ooof’s finding.[1a] The exact numbers @ooof and I come up with may differ, and only Twitter itself would be best able to reveal how many Chinese Twitter users there actually are, but our independent results are likely within an order of magnitude to the actual number of Twitter users in China, unlike GWI’s result which is about 2000 times greater than our calculations. The hard evidence backs up what our eyes are telling us.
If you’re interested in the technical information of how I performed this fairly rigorous (though certainly not at the level of an academic research paper) test, read on. (Apologies for the non-Weibo-related post; I hope it’s still of relevant to those who read this blog.)
According to the publicly available search results data from Twitter, nearly 44,000 users posted a message that Twitter classified as a Chinese language tweet during the 24 hour period between 12:38 AM EST Thursday, Jan 3rd and 12:38 AM EST Friday, Jan 4th. I arrived at this finding by utilizing Twitter’s search by language feature which you can access via their advanced search tool or simply using the search term operator “lang:zh”. Switch it over to realtime searches (if you’re more familiar with the Twitter API, essentially changing the result_type from “mixed” to “recent”) and you have a Twitter stream of all recently posted Chinese tweets—or at least what Twitter guesses is Chinese.
Twitter, like other folks (for instance, Google Chrome, which can detect if a webpage you are visiting is in a foreign language and will suggest if you’d like to translate it into your native language), utilizes an algorithm for guessing what language a tweet is to be classified as. The algorithm is not infallible, and I noticed that a small percentage of tweets on Chinese Twitter users’ streams were being classified as Japanese. For instance, take someone who posts primarily in Chinese, like Michael Anti. If you examine his Twitter stream via the REST API  and look for the key “iso_laguage_code” you’ll see that the large majority of his posts are labeled as “zh”, which is the code for “zhongwen,” i.e. Chinese (中文), but as of right now, 7 of his last 100 posts are marked as Japanese (80 are Chinese and 11 as English).
Obviously, because of the overlap in Chinese characters and Japanese kanji, this is bound to happen for just about any computer-based analyzer.  I thought about just doing a search for a whole host of common Chinese characters that were less commonly used in Japanese in order to get a more “pure” and inclusive list of Chinese language tweets, for instance 是, 的, 好, 不, 我, 有, 小, 他, 也, 你, etc, but what actually gets returned is a messy mix of Japanese and Chinese posts (and not even all Chinese posts since some don’t include these words) and for it to be useful you’d then have to develop your own tool for separating out the Japanese posts. Thus, for my purposes—getting something like 80+ percent of all the Chinese tweets—Twitter’s internal classification of what is Chinese is good enough (I’ll verify this in a moment).
Next was how to download these tweets that were marked as Chinese (the language—not as from China itself, that requires another step to be explained in a moment). Twitter has a wonderful API and a ton of developer documentation. If you have a question while creating a Twitter app, someone probably has already asked it and gotten a good answer. It’s a great community, but due to some very valid concerns (remember what-used-to-be the ever-so-common fail whale?…), there’s some fairly extreme rate limiting on accessing the search and timeline API. You can only hit Twitter’s server a certain number of times an hour before it cuts you off. Plus, I couldn’t figure out a way to have the REST search API return a list of all Chinese tweets without including a search term (I get the error “You must enter a query” when I drop the “q=”). This caused me to use the public search widget mentioned above, which according to Twitter matches what you’d get from the REST version anyway. The great thing about the search widget was that I didn’t experience a rate limit like I would have with the REST search API, allowing me to simply keep scrolling endlessly as long as I wished (until the browser crashed due to memory constraints). I put a paperweight on my keyboard’s page down button, had lunch, and came back to copy the many thousands of Tweets now in my browser.
How many tweets exactly? 193,940. These 193,940 tweets were all the original Chinese-language tweets (native retweets as well as, according to Twitter, messages detected as spam, were filtered out from this public search) posted between 12:38 AM EST Thursday, Jan 3rd and 12:38 AM EST Friday, Jan 4th and able to be found via the Twitter search API. Due to time limitations and a burning anxiety to get cracking, I only did a 24 hour period. If this were an academic paper or such, I would have captured a full week’s worth of tweets or possibly even more, but, well, I didn’t feel like waiting. According to @ooof’s graph, he used a whole month’s worth of tweets, which explains why his number of active users is more than mine.
An important note: these 193,940 tweets do not include every possible tweet that someone in China might have posted. Users who have made their tweets private obviously don’t have their posts show up in public search nor did my method collect tweets from people posting in non-Chinese languages from China (thus, ex-pats in China, unless they write in Chinese, are not included in this data). But otherwise, it sure looks like everything: it even includes a Chinese-language tweet that I, a self-classified English-language user in an American timezone, sent to @ooof. But to more rigorously assess the public search’s performance, I again went back to Michael Anti’s timeline and looked at all the 14 original tweets he made during my observation period. Of the 14, I found 11 in my downloaded data (and 1 more as an old-school retweet by someone else). I checked the 3 missing tweets and they are all listed as Chinese, so perhaps Twitter classified them as spam or simply didn’t capture them in the search; regardless, 11 out of 14 isn’t bad for my purposes, and, if I wanted, I could check other user’s timelines to see how many of their tweets were included in my download and adjust my numbers accordingly to account for those missing tweets. However, the takeaway is that the tweets I downloaded are, if not absolutely everything, than fairly close, and though any calculations I make might be off by some percentage, it’s at least within the correct order of magnitude.
Having the set of all tweets during this 24-hour period, it was then trivial to extract out all the unique usernames (because some users posted multiple tweets during that time period), leaving us with 43,784 users who posted something in Chinese. We can then use Twitter’s GET stauses/user_timeline to look up a user’s timezone, language setting, self-described location, and geo-coordinates (here’s what mine looks like) and use a JSON parser to extract the information cleanly.
Due to rate limiting, it’s not feasible to check all 43,784 users, so I took every 73rd user (ordered by when they most recently made a post) to come up with a sample of 608 users. 165 were missing any timezone classification (two of them because they had switched to private mode, thus taking away access to their timezone info), comprising 27% of the sample, and 110 were listed as located in Beijing’s timezone, 18% of the sample, numbers which largely mirror @ooof’s conclusion (see below table).
If I extrapolate out those percentages to my total population of 43,784 users, I get roughly 12,000 missing and 8,000 in Beijing. Of course, this 8,000 is the least it could be; as mentioned, it doesn’t include those who set their accounts to private, doesn’t include folks who may have their timezone mistakenly set elsewhere, doesn’t include users who didn’t post in that 24 hour period (these 7,921 might be considered hardcore daily Tweeters and certainly doesn’t include the roughly 20-40% of users who have never tweeted),[7a] and may miss out on any users whose tweets accidentally were marked as spam or were not captured in Twitter’s search API. All of those reasons explain why my number is likely an undercount of the total number of Chinese Twitter users, but as demonstrated previously, it likely isn’t off by a whole lot. The primary reason why my number is so much lower than @ooof’s is because his data collection period appears to have lasted for a month, and thus he captured the more casual Chinese Tweeter; otherwise, my percentages largely confirm his. Here’s the more detailed breakdown of which timezone user’s reported themselves as being in:
As for the other data I collected on this sample, location info was largely useless since it is user-specified. If folks decided to enter anything at all, it sometimes came in the form of fake locations like “In your HEAD” and “On your bed.” Of the 364 who did supply a location, 40 contained either “China” or 中国, and if I had time, I could sift through the rest and try and figure out if they might also be candidates to be China-based users.
Finally, I looked at the primary language a user specified in their settings, which looks like it suffers from a much greater than expected number of English language users, likely to to Twitter defaulting to English. I’m not certain how Twitter chooses your initial language, whether it’s always English unless you manually set it, or if it takes the language of the browser or perhaps your IP address (which perhaps redirects you to a location/language-specific signup page), but this data is flawed. Regardless, here’s a pie chart of the percentage of languages specified in the 608 person sample in case you’re curious.
I can’t conclusively say whether there are 10,000 or 18,000 Twitter users in China, but based on the data I pulled and the method I used to analyze it (and without knowing more, probably a method quite similar to what @ooof used), I can say conclusively that there are NOT 35 million Twitter users in China. If there were indeed that many, you’d see it in the quantity of Chinese-language tweets. Looking at the Twitter stream, there just aren’t that many Chinese language tweets. However, despite the various limitations mentioned above in my data collection process (only one day, doesn’t include private accounts, doesn’t include non-Chinese language posts from China), the number of active Twitter users in China is almost definitely between 10,000 and 100,000, several orders of magnitude less than what GlobalWebIndex calculated from their social media in China survey.
[1a] UPDATE Jan 6, 2013: My metric for Twitter user is different than GWI’s, which has a more expansive definition (h/t to Josh Ong of Tech in Asia for pointing this out). Because of time constraints, I was only able to look at users who made a post during a 24 hour period—the potentially most active of users. I believe @ooof used an entire’s month worth of data and culled his user list from there—which is why his number is larger than mine. GWI’s definition includes those who simply “Use or contribute to the service,” and thus my data would miss out on those who use it for reading tweets and posting occasionally. Even so, based on statistics supplied by Twitter, other sources, and even GWI itself, one could adjust for those who only “listen” or post occasionally, and still not even get close to a million (see footnote 7a below for more).
As for why it matters how many Chinese Twitter users there are, I believe Martin Johnson and the folks at GreatFire.org said it best:
GlobalWebIndex claims there are 70m Twitter users and 125m Facebook users (63.5 million active) in China. That means that at least 125 million netizens in china have either bought a VPN service, are using a proxy or have the technical know how to bypass the great firewall. This is highly unlikely. If it was true, there would be so much online China activism about important issues that it would be hard to ignore. This is what we hope happens in China’s future and that’s what we are fighting for but it certainly is not the reality now.
The last thing we want to see is people saying that Chinese netizens have free and open access to social media around the world. They don’t! They are prevented from looking at many foreign web sites and they are also prevented from accessing information on Chinese web sites! Chinese netizens, for example, are unable to search for “Xi Jinping”, the country’s next leader, on Sina Weibo, a leading Chinese microblog. The great firewall is not some myth, it’s a sad reality. Chinese censorship authorities will be delighted to see this news as it makes the rest of the world believe that censorship is not happening here.
 Version 1, which is apparently on its way to being mothballed in favor of 1.1 which will require authentication, so this link may not work in a couple months. ^
 Though based on what I’ve seen, Twitter’s algorithm, though serviceable, could definitely be improved. ^
 If someone knows what value to set q= to, by all means let me know on Twitter or via the contact form. Apparently if you have Firehose access, you don’t have to deal with rate limits. Also, if I’m reading things correctly, Twitter’s new streaming API supposedly lets developers hook into the public stream and just suck up tweets that match certain criteria with a much greater range than the simple search API that I relied on, which, as Twitter warns, is not exhaustive, supposedly with spam messages and the like being filtered out (a rather good side effect of having to use the search API rather than the streaming API). As I don’t have access to the former, which is apparently very hard to come by, and a lack of time in learning the second, I went with the quick-and-dirty approach in this investigation. If this were for a research paper or something where I needed much more precision, certainly, the streaming API would be the way to go, but as I mention later in the post, my method was for the most part good enough. Someone who has an extensive database of tweets like the folks at Sysomos claim could arrive at an even more precise number than we have. ^
 According to Twitter, this REST version of the search API is the exact same thing as what you’d get with the general search tool/widget: “The Search API (which also powers Twitter’s search widget) is an interface to this search engine.” ^
 I told you, not super scientific was I in this task, but this was by far the fastest way and didn’t sacrifice anything in the data collection. ^
 Native retweets are the ones where you just click the retweet button in Twitter and they appear instantly on your timeline with the other person’s profile photo. Old-school retweets, which are included in my set of downloaded tweets, are when you manually copy and paste a persons tweet and append an RT in front. Excluding native retweets hopefully reduces the amount of robot accounts which do nothing but aggressively retweet. ^
 My sample also had 3 users who selected Chongqing as their timezone. I grouped that into Beijing for the above pie chart, but broke it down in the table. ^
[7a] UPDATE Jan 6, 2013: The percentage of Twitter users across the entire service who have never posted fluctuates depending on the report and the date. Twitter reported in 2011 that 40% of users didn’t post. Two other reports from 2009 put the number at 21% and 40%. For its part, GWI reports that their survey shows 66% of their Chinese respondents who used Twitter hadn’t posted a tweet in the past month. Even if that’s the case and ooof or I need to adjust our numbers upward to account for those who would go missing in our data, the number would still only rise to roughly 50,000 (18000*(100/100-66)=52,941). In order to adjust 18,000 upwards to 35 million, roughly 99.5% of actual Chinese Twitter users would have to have never posted in the past month, a very unlikely occurrence which is out of line with not only worldwide data but also GWI’s own survey.
 So long as a user had even one tweet get listed in the Twitter search, they were included in my total of 43,784. If you wish to verify, check any user who made a Chinese post on Jan 3 and check to see if they are on this list. If not, do let me know. ^
 The only one where we differ greatly is Tokyo, with his data concludes that under 1% reside there while mine puts it at over 3%. This could simply be a matter of our samples or something else; otherwise, everything else matches fairly well. ^
 If you search for all the English-language posts on Twitter the same way I did for Chinese, you’d have to scroll for a very, very long time before you even go back through a single minute’s worth of tweets. ^