They live in Guangdong (well, many of them do at least):
Some background: Now that I finally got around to playing with Weibo’s API, I’ve been collecting (you might call it hoarding…) a lot of fun data. I’m currently engrossed in this dataset I’ve developed of anti-Japanese comments and I’ve been doing a lot of spatial analysis—all of which is only possible because Weibo neatly provides a wealth of detailed location data included with every post/comment. Whereas Twitter offers whatever location a user supplies (“In your head”; “Your mom’s house”) along with a time zone (geo-coordinates and detailed location info are only available on a tiny percentage of tweets), Weibo’s API neatly gives you every user’s province, city code, and chosen location. The options are selected, not filled-in, so the data is super clean and crisp (well, outside of people who lie about their location).
Thus, seeing as it might be helpful for my other projects to know where Weibo users are blogging from (or at least say they are), I conducted a data expedition, grabbing the latest 200 posts from Weibo every five minutes for one full week. After discarding repeat messages (Weibo’s API doesn’t guarantee the posts are the absolute most recent, though for the most part, the majority of the posts matched my download date-time), I came up with a sample of 283,109 unique users, 236,611 of whom live in mainland China and which I used to generate the map above and chart below (this whole exercise was basically an excuse to show off some of Google’s super easy-to-use Fusion tables and an unnecessary distraction to my thesis writing, sigh).
The above map doesn’t differ too much from the information which others have already collected. For instance, Tech in Asia summarized a report last year that showed largely the same breakdown, albeit in less detail:
Obviously Guangdong is the primary source of all Weibo posts, but now we can put a number to it: roughly 23% of all Weibo posts come from there, which extrapolated across 400 million Weibo users (caveats about bots and inactive users of course), comes to over 90 million users.
However, is Sina’s Guangdong-centric userbase that totally out of line? For this, we can compare it against a number of Chinese statistics on Internet usage. First, let’s look at population: according to China’s 2011 Statistical Yearbook (which is out best source for now until China releases its 2010 census), Guangdong’s population as of 2010 year-end was 7.83% of China’s total population. Here’s a map of the other provinces which show the greatest differential between percentage of Sina users in my sample and their share of national population:
Guangdong and Beijing soak up most of the bright green, but out west, Tibetan netizens have adopted Weibo much more widely than their neighbors, with their estimated Weibo share roughly in line with their population. Guangdong leads the pack in raw percentage point difference—with each percentage point difference representing roughly an additional 5 million more or less Sina Weibo users than expected (if you’re interested in relative difference, scroll to end of above chart). Shandong (CN-37 in the map; see the chart if you don’t know your Chinese province codes), which lags the pack at -3.76% points difference between their estimated share of Sina users and their share of the national population, likely has 15 million less Weibo users than would expected, which is surprising considering Shandong, as a whole, isn’t particularly poor, with a GDP roughly in the middle of the group.
We can look at other statistics to see whether Shandong citizens’ reluctance to adopt Sina Weibo is driven by non-Weibo reasons. For instance, places where lots of folks don’t have Internet service might be a hindrance toward Weibo adoption (duh). For Internet access measures, CNNIC (China Internet Network Information Center) releases a bi-annual report on Internet usage in China. Their year-end report looks particularly good, with solid sample sizes and broad reach. I’ve included the past 3 years of CNNIC’s Internet user breakdown in my chart above (I manually generated the Dec 2010 Internet penetration number, which was not reported by CNNIC). From there, we can do the same check: which provinces have a percentage of Weibo users that is more or less than reported Internet users?
While most other folks fall roughly in line with the CNNIC Internet stats—Guangdong, Beijing, and Shanghai are the clear outliers on the positive side—Shandong is joined by Hebei as laggards in terms of Weibo adoption according to this measure.
One final provincial metric: the China City Statistical Yearbook also reports numbers of households with “international” Internet connections at the city/prefecture level (thanks to Zhang Haihui for cluing me in onto the source). China Data doesn’t have the most recent editions, but if you search around online, you can dig up some nice pirated copies. The data looks a but iffy—whoever is reporting data from Shanghai clearly has a very different idea of what an Internet household is, with numbers that are greater than their total population for several years running—but it’s the only official city-level breakdown of Internet usage that I’ve seen (please do let me know if you know of others). So after adjusting Shanghai, Chongqing, and Fujian’s numbers to match their CNNIC provincial shares (I thought about dropping them altogether, but that inflates everyone else’s numbers), you have column AE above, which doesn’t differ too much from what we’ve shown thus far, except perhaps Liaoning’s rather high household Internet connection rate according to the City Yearbook makes it appear to be less connected to Weibo than expected.
Before we go on to the city-level breakdown, here’s a map of Weibo’s international users (I told you this was all just an excuse to show off some Google tools):
Takeaway? USA for the win—though the scant 48 people posting from India are notable. For a neighbor with a giant population and an avowed interest in engaging China, India doesn’t seem to have caught the Weibo bug. Then again, the Chinese population in India is tiny, roughly the same size as that in Italy or Laos. But—then then again—there are less Chinese in Brazil, and yet, Brazilians outnumber Indians 2:1 in my sample. But then then then again, these are relatively small numbers. But then then then then again, the sample size is huge, and if I were the kind of person who carried around stats tables, I might be able to tell you how super confident I am in these numbers.
Now for the city breakdown (click here to open it in a new window if you don’t want to scroll up and click on the third, fourth, or fifth sheets). Sheet 4, “CityStats,” reflects the 2011 China City Statistical Yearbook data for households with Internet that I mentioned previously. “WeiboByCity” is the location data I pulled from my sample, broken down by city. You’ll notice in the “WeiboByCity” data there are a number of rows with no city codes; those reflect users who apparently didn’t specify a city or merely chose a blank field when asked to provide their city (which contrasts with the province code, which all users were required to provide). Also note: the four municipalities—Beijing, Shanghai, Tianjin, and Chongqing—are different in that they are broken up by county and county-level districts, and thus when you look at the “CityStats” sheet those four aren’t broken down since they have no city/prefecture-level districts to report. You’ll see further down in the “WeiboByCity” sheet a category also devoted to people who literally chose “Other” as their province as well as a breakdown of non-mainland China and Taiwan data. For the following preliminary analysis’ sake, I dropped those who didn’t select a specific city code and assumed that those who didn’t would have been evenly split up amongst their fellow provincial citizens. Obviously, that’s something I’d have to more rigorously test before I use this level of data, but for now, it’s good enough to get going.
Here’s a breakdown of my sample versus the Internet using-households by city code. 1 indicates that the user/household lives in the capital city of the province. As you can see, Weibo users apparently tend to cluster in capital cities (which here is a sort of proxy for urban) as opposed to standard Internet users. (Honestly, I have no good reason for presenting this as a pie graph except it looks moderately cool, or as cool as vanilla pie graphs can get.)
One last map. This is each city (note: Chinese “cities” here refers to an administrative region that is the primary sub-unit of provinces, also known as prefectures; cities are really big and contain lots of counties and towns; the dot on the maps are somewhat deceiving in that most cities encompass a much larger area than depicted here) coded by the difference between the percentage they made up in my sample and the percentage of population they contain out of mainland China as a whole. Blue is greater than .5% (meaning the city’s share of mainland Weibo users is higher than it’s percentage of the national population) and red is below .5% (yellow means it falls in between). Not surprising, we have a cluster of blue in Guangdong and various other pockets throughout (warning, the map seems to be buggy on my version of Chrome; try zooming in further and then zooming out to fix any display glitches). The data here is slightly skewed because not every “city”-level district has population statistics reported in the City Yearbook, particularly those from autonomous regions. Thus, depending on how you think the missing data would’ve affected the true share of each reporting city’s population, the difference between their Weibo shares and population shares might change (though only fractionally; the City Yearbook data enumerates over 1.2 billion people, a less than 100 million person difference between the 2010 NSB number; a big number, but probably wouldn’t mess up the shares too drastically depending on how the two differ in reporting). There are other metrics at hand to show something similar without having to deal with missing data, but I just wanted to show what was possible from this data.
Here’s the same map with the difference in percentage points divided over the population share in order to give you a sense of relative deviation from what we expect in Weibo numbers based on the city’s population. Large red points represent a percentage rate of Weibo users that is more than 90% less than the city’s share of national population (small red is between -75% to -90%). Conversely, large blue points represent a Weibo rate that is 90% greater than its population share (small blue is between +75 to +90%).
This new map isn’t drastically different, though now smaller cities are more fairly represented with those big blue and red points (representing extreme deviations). For example, while on the whole Guizhou citizens lag behind the rest of the country in terms of Weibo adoption, Guiyang’s 3 million citizens, which make up a shade under 20% of Guizhou’s population, are much more likely to use Weibo (I’ll confirm it statistically once I re-code everything properly in Stata). You wouldn’t be able to tell simply by looking at the raw percentage difference as displayed in the first map since Guiyang’s population isn’t massive, but this second makes clear that Guiyang’s citizens use Weibo at a much higher rate than expected based on population (or household Internet connectivity; the City Yearbook number puts Guiyang’s share of the province’s Internet households at 20%, the same as its share of the province’s population).
Update Mar 3, 2013: A perhaps slightly easier way of thinking about what the above map actually represents is how many Weibo users per capita are in each city. I rescaled the stats to represent this in the below map. Suppose there are 400 million Weibo users (or, more accurately, accounts since some people have multiple accounts) in mainland China; rounding China’s population to 1.2 billion, that would mean there was one Weibo account for every three citizens in the country, or .33 Weibo users per capita. Thus, the below map has blue arrows representing cities that have over .6 Weibo accounts per capita (Shenzhen leads with over 6 Weibo accounts per citizen! EDIT 3/3/12: though now that I’m double-checking, the City Statistics show Shenzhen’s population to be 2,598,700, which is a rather severe undercount of what should be something like 10 million; by contrast, the other cities in Guangdong report populations more in line with what they should be, so this may be a blip of a reporting error); blue dots = .22 to .6 (.22 is 75th percentile); yellow dots = .06 to .22 (.22 is the median); red dots = .02 to .06 (.06 is 25th percentile); and anything less than 2 Weibo accounts per hundred citizens is represented with a red arrow.
I ran a couple of simple cross-sectional regressions on various variables in the above chart, and thus far, after controlling for population of the city, population of the province, and provincial GDP, the number of households with Internet connections in city districts (as opposed to the city as a whole when both are included in the model) as reported by the City Yearbook is highly significant and the best predictor of how many users in the sample were from that city (I’ll report back once I do a more thorough examination):
Final summary data about my sample:
- Female/male: 56.57%/43.43% (markedly different than Weibo’s reporting of 50/50 split; the female heavy split was shown in both my week- and day-long samples; this 2012 GMI China infograph says the split is 52/48 female, while this DCCI one says 57% male; I imagine you could unearth dozens of these things, but I’m not a market research guy so I’m going to stop here.)
- Verified: 3.33%. A logistic regression with female as the independent variable is significant and shows females to be 63% less likely to be verified.
- Followers count: median of 167, mean of 4,623
- Friends count: median of 153, mean of 299
- Number of posts: median of 366, mean of 1,561 (takeaway from these three: most folks are casual users, but a few super-heavy duty users skew the mean)
- A large percentage of the users who posted in my sample had, unsurprisingly, recently signed up to Sina Weibo. Median number of days for how long a user had had their account for: 783; mean: 786 days (putting their sign up date in January 2011; the “oldest” users in my sample signed up Aug 26, 2009. Newer account are more likely to be verified (real-name registration effect). Here’s a graph showing that:
- A whole heckuva lot of Weibo users post from their iPhones or iPads. (Yes, I can tell you which brand of phone Weibo users are posting with.)
- Over 20% of posts contained at least one emoji. 143 posts (.05%) contained one emoji, and nothing else.
Cautions and notes about data collection: As mentioned, I pulled the data from Sina using its public timeline API, but as Sina warns, it’s not strictly real-time, but as I’m simply trying to collect user demographic info, that wasn’t too big of a deal (I haven’t yet thrown it into Stata yet, but after eyeballing the data, I’m fairly certain that the correlation between the date-time downloaded and the date each post was made would be super high). However, the way I sampled, by hitting Weibo every 5 minutes, would ensure that I oversampled from users who posted at off-peak hours, as proportionally, they would have a higher chance of being selected by me. Understandably, this was a major concern of mine since theoretically, my sample could have way too many people from overseas, but hopefully, by extending the sample period to a full week and by taking so many samples (roughly 1,700 unique/usable samples per hour), any over-representation would diminish to acceptable levels over time and eventually I would recover the actual mean (thanks to Prof. Landry of the guidance on this front). As a dry-run, I did a 24 hour long sample the week before, results for which I’ve reported above. As you can see, the results I pulled seem to line up with each other, though it does make me pause to see that overseas posts actually ticked up in my week-long sample. That may be because I’m overthinking this proportionality issue. Trust me, I thought way too long about how I would implement some true proportional sampling method that would take into account time, but with with the tools available to me, a non-高级 level Sina developer, this was the best I could be.
Furthermore, this is only one week’s worth of data capture. Thus, in order to generalize it out to Weibo users as a whole, I am assuming in this exercise that the composition of Weibo users hasn’t changed that much over the past 3 years. But if this is not the case, if for instance, most Weibo users during the early days were around Sina’s headquarters in Shanghai and there are less of them signing up and posting today, then my sample would under-represent them as a group. Thus, a remimder that this sample is of recent active posters. Those who comment and don’t post to their own timeline are not captured, nor obviously are those who only use Weibo to read posts. However, if all provinces/cities have similar percentages of non-active users, then this sample can be generalized upward.
Finally, all this is moot—my whole approach for sampling Weibo users—if for some reason the Sina public timeline API didn’t randomly draw from its pool of public tweets. I don’t know the nitty-gritty behind how it works, but if it somehow favored folks who posted via, say mobile phone, and more folks post from mobile phones in certian regions, then my time has been wasted (well, my computer’s time, mostly).