Showing posts tagged china

<List(s) of Chinese keywords for censorship testing and sensitive content collection>

Last week, a researcher during The Citizen Lab’s annual Connaught Summer Institute workshop raised an interesting problem. She wanted to test for censorship on a Chinese online service, and she had somewhat limited resources and time. What keywords should she use for her test?

In theory, this is a solved problem, what with the numerous lists of censored and sensitive Chinese keywords available on the web, including those shared by this site. However, sometimes the keyword list may be too broad for one’s taste, or may simply have too many keywords to efficiently use. And plus, what if I only want to test the most sensitive of the keywords, e.g., Falun Gong, June 4, Xi Jinping, and so on? For those not experienced in Chinese or Internet censorship, this can be a daunting task to winnow down already existing lists to something more usable.

Thus, a few of us sat down at the workshop and we collected 8 known Chinese keywords lists (see below) and aggregated them together in a single, easily share-able and sortable file, which we’ve posted to Github. The CSV files contain not just the keywords, but all sorts of other info like translations and tags (though not all of them; it’s an ongoing project which you are welcome to contribute to since it’s an open-source project).

As of Aug 4, there are 8,087 sensitive keywords collected from 8 different lists. To get a sense of what data is included in these CSV files, you can view a spreadsheet of these 8,087 keywords sorted by the number of lists they appear on.

Creator Tested on/found from # of keywords Year Method + source
The Citizen Lab Sina UC 1,818 2013 reverse engineered from the client; more analysis here; download link
The Citizen Lab Tom-Skype 2,574 2013 reverse engineered from the client; more analysis here; download link
The Citizen Lab LINE 673 2014 reverse engineered from the client; more analysis here; download link
Jason Q. Ng (Blocked on Weibo) Sina Weibo 839 2013 running Wikipedia China article titles through Sina Weibo search; more analysis and book
Xia Chu Great Firewall 669 2014 HTTP request scans of Wikipedia China articles to see if they’d trigger GFW block; more analysis here; download link (removed duplicates and keywords related to meta and user pages)
China Digital Times Sina Weibo 2,448 ongoing crowdsourced testing of suspected sensitive keywords on Sina Weibo; more analysis on CDT and in CDT’s Grass Mud Horse Lexicon e-book; download link Wikipedia 488 2013 testing to see if Wikipedia pages are available in China; more info; download link
Google/ Google/Great Firewall 456 2012 and reverse engineered the keywords Google was using to warn users of censorship while using their service in China; download link

To follow future changes to these lists, you can follow the Github repository. You are encouraged to adapt and update these lists as you see fit, however please do credit back to the Github repo if you do. Hopefully this is helpful to researchers who are searching for sensitive content in Chinese or testing for network interference.

<64 Tiananmen-related words blocked today (June 4, 2014)>


     (photo credit:

Today, on the 25th anniversary of troops being ordered into Tiananmen Square to clear student demonstrators, I tested several hundred June 4-related keywords on Weibo. I used the same set of keywords that I tested last year at The Citizen Lab, and I found relatively similar levels of censorship this year. Also like last year, the enhanced restrictions on keyword searches were apparently implemented specifically for the anniversary: for instance, 坦克 (tank) and 六四 (6-4, i.e., June 4) were free to be searched as recently as May 11.

A short article over at WSJ provides some more context and lists those 64 keywords that I identified as being blocked from searching on Weibo today. Of the keywords I tested, there were nine that were unblocked last year but have been added to the blacklist this year, including 八九 (89), 维多利亚公园 (Victoria Park, the site of the commemorative vigil in Hong Kong), and VIIV (roman numerals for June 4).

After the jump are the 64 keywords that are currently blocked from searching on Weibo (though some will no doubt be unblocked once this sensitive period passes):

Read More

<Comments and takeaways from Xia Chu’s “Complete GFW Rulebook for Wikipedia”>

Note: Update to “64-byte search string limitation indicates Weibo and GFW” section added Jan 11

If you are interested in Chinese Internet censorship, I highly recommend you flip through Xia Chu’s latest update to his* research project “Complete GFW Rulebook for Wikipedia.” This latest revision of a document originally released last October, it identifies a massive list of actual trigger words (which Xia calls “rules” because they are often attached to specific conditions) which cause a Chinese Internet user’s connection to specific sites like Wikipedia to be disrupted by the Great Firewall (GFW). Not only that, it also includes a list of over 3,600 websites that he has currently confirmed to be unreachable from within China due to the GFW. The conclusions in the paper don’t necessarily upend anything that we thought about the GFW, but if you want a peek behind the curtain of how the GFW works (big takeaway: IT’S REALLY HAPHAZARD), this is as close as we can currently get.

The methodology behind Xia’s testing is sound and the breadth is among the most comprehensive attempts to document the Great Firewall’s blacklisted keywords—though Xia notes his debt to Jed Crandall et al’s ConceptDoppler paper,, and others, including arrested civil rights lawyer Xu Zhiyong, for inspiring him. The paper is mostly jargon-free, and the testing process used is transparent and not at all ultra-sophisticated (a compliment!); an amateur coder like myself could replicate everything that Xia has done in the paper. The paper is pretty self-explanatory and there’s not much commentary for me to add, but below are a few notes I’ll make including a description of a similar tool I’ve developed for identifying sensitive keywords in Chinese news articles as well as how there are curious coincidences between how Sina Weibo and the GFW censor.

Read More

<Looking back on 2013: Five Blocked on Weibo posts I particularly liked from last year>

2013 has personally been an incredibly fun year. I finished grad school, my book was published, and I started working for this neat research lab. Chinese Weibo users though, especially prominent ones, had a particularly rougher time, with increased harassment and censorship by authorities inducing an unfortunate chill on discussion of sensitive topics on the site. Here’s hoping the next year brings a relaxation of such policies: I couldn’t be happier if I had nothing to write about on this blog.

So before we move on to 2014, a look back at five Blocked on Weibo keywords and posts that I particularly enjoyed uncovering and writing about in the past year:

1) Jan 23: 宪法法院 (constitutional court) is blocked during the Southern Weekend censorship controversy.

2) Mar 9: Weibo censors delete post of masked Mao portrait criticizing Beijing air pollution.

3) Jun 4: “The Flower of Freedom” (自由花) is a Cantonese song written by Hong Kong lyricist Thomas Chow to commemorate the victims of the 1989 Tienanmen crackdown.

Read More

<The Chinese keywords on messaging app LINE’s “bad words” list and why they are “bad”>

For an updated summary of the Citizen Lab’s excellent research into censorship on LINE to which I contributed, see this Nov 21, 2013 post.

Back in May, Twitter user @hirakujira was poking around in the code for Lianwo, the Chinese version of the popular mobile chat app LINE, when he noticed a curious line: “<key>warning.badWords</key>” followed by a string that read in Chinese: “Your message contains sensitive words, please adjust and send again.” Hirakujira subsequently identified the application files which contain these so-called “bad words” and posted them. The Next Web and Tech in Asia reported on how even though LINE (which is a Japanese spin-off of Naver, a Korean company) wasn’t yet actively censoring messages sent through its Chinese-branded app, the inclusion of such files indicated they had built in such a capability into the program—a forward-thinking move for any foreign content provider/distributor that hopes to succeed in China.

What I hope to do in the next few days is to take a closer look at the first roughly 40 of the 150 words that Hirakujira posted, translating and explaining the significance of those words with respect to current Chinese politics. Some are quite obvious, but others are quite obscure. By examining the words, we may hope to get a sense of what LINE thinks is worth censoring in order to appease their Chinese regulators. So for the next few days, consider this site rebranded as “Blocked on LINE (maybe in the future).”

The twenty-one posts (links will be added as the posts go up):

  1. 浙江签单哥Zhejiang’s receipt-signing Brother
  2. 警察杜平 / Police Dupin; 宣恩杀人现场 / Xuanen murder scene
  3. 叶迎春内衣 / Ye Yingchun underwear; 叶迎春 / Ye Yingchun
  4. 孙国相拆迁 / Sun Guoxiang demolition
  5. 中央领导内幕 / Central leadership insider
  6. 盘锦开枪 / Panjin shot; 四学者建言 / Four scholars suggestions; 
  7. 盘锦二表哥姜伟华Panjin, Second Watch Brother: Jiang Weihua; 姜伟华名表 / Jiang Weihua namebrand watches; 江诗丹顿 表叔 / Vacheron Constantin uncle
  8. 只身挡坦克 / Tanks block alone
  9. 爆料不孝女Expose: unfilial daughter; 爆料朱熹后人 竟是政协委员 / Expose: Zhu Xi’s descendants, suddenly CPPCC committee members
  10. 人大附中择校费杨东平Renmin High School, school choice fees; Yang Dongping
  11. 奥数叫而不停 / Complaints about Math Olympiad have not ceased
  12. 帝都 实行宵禁Imperial Capital implements night curfew
  13. 11月5日至15日 出租车禁行Nov 5 to 15 rental cars banned; 表叔 陈应春 / Uncle Chen Yingchun
  14. 江泽民被控制Jiang Zemin has been controlled; 江系军委被撤 / Jiang withdraws from Military Commission
  15. 张蓓莉200万耳环Zhang Peili 2 million RMB earrings; 温家 戴梦得 / Wen [Jiabao] Diamond; 温家宝 27亿 / Wen Jiabao 2.7 billion [USD]; 影帝温家 / Actor Wen Jiabao; 温家 资产700亿 / Wen Jiabao assets 70 billion
  16. 网络封锁 / Internet blockade
  17. 维族 砍人Uyghurs stab people
  18. 和田 暴乱 / Hotan rebellion
  19. 万鄂湘亚视 / Wan Exiang, Asia Television Limited
  20. 李正源李刚Li Zhengyuan, Li Gang; 交警夏坤 / Traffic cop Xia Kun
  21. 64屠城June 4 massacre

雪山狮子旗 (snow lion flag  / xuěshān shīzi qí) was the state and military flag of Tibet. It has six red bands representing the six original ancestors of Tibet, a rising sun over a mountain, the three-colored jewel of the Buddha, and a pair of snow lions, Tibet’s national emblem. Though it was the official flag of Tibet, it was rarely used before 1959, when China took control of the region.

Why it is blocked: After the failed Tibetan rebellion of 1959, the Dalai Lama went into exile. The snow lion flag soon came to represent the Tibetan independence movement and is now a well-known symbol of the Free Tibet movement. The flag is no longer recognized by China, as it is considered an affront to its sovereignty over Tibet.

This post comes from my book Blocked on Weibo, which contains nearly a hundred new entries documenting the kinds of keywords suppressed on social media in China. You can order online now at your favorite online store or pick it up at your local bookstore next week. Thanks again for supporting this blog!

<480 keywords blocked from searching on Weibo as of Jun 29, 2013>

During the past month, I’ve been working as a summer research fellow at The Citizen Lab in Toronto. It’s been great to not only have time to dedicate to updating this blog and pushing forward collaborative projects with researchers I’ve been fortunate enough to meet over the past two years, but also to pitch in with all the amazing work being done here at this one-of-a-kind lab. Among the projects I’ve been helping out with is one pertaining to the list of censorship and surveillance keywords in the Chinese chat clients TOM-Skype and Sina UC, which the team decrypted then analyzed in collaboration with Jed Crandall and Jeff Knockel at the University of New Mexico. 

Of course, my first desire was to take the keywords they extracted and to test them on Weibo. Below are 480 unique keywords which were blocked from searching on Sina Weibo as of June 29. I’ve written more about the other censorship games I’ve detected in this post over at The Citizen Lab’s blog. Among the things I discuss are the overlap of keywords between different Internet services in China as well as what drastic changes in the number of search results for keywords might mean.

A full spreadsheet of the data mentioned in the report can be viewed in this Google Fusion Table or downloaded in .csv format for further analysis by all you researchers reading along at home. I look forward to sharing other relevant work my colleagues and I get done at the Lab during the rest of the summer.

The removal of “June 4” from the list of blocked terms—an area of much ridicule for Weibo both in Western media and among Chinese netizens, many of whom evade the censors by using alternative coded slang to stand in for sensitive keywords—may be a sign that Weibo has become more comfortable trusting its human censors to manually delete sensitive posts quickly and effectively. They’re slowly moving away from the crutch of the keyword block, which while certainly effective at preventing the spread of sensitive information, is also at times overly broad and not responsive enough to more precise needs. … What is and is not off-limits has now become slightly harder to determine—another step in making censorship invisible and all-pervasive.
My article “Weibo Keyword Un-Blocking Is Not a Victory Against Censorship” on Tea Leaf Nation, cross-posted on The Atlantic website

In a fantastic interview over at Five Books, The Economist’s China correspondent Gady Epstein recommends five books you should read to understand what’s happening with the Internet in China: Guobin Yang’s The Power of the Internet in China; Rebecca MacKinnon’s Consent of the Networked; Anne Marie-Brady’s Marketing Dictatorship; Jonathan Spence’s Treason by the Book; and Johan Lagerqvist’s After the Internet, Before Democracy.

He talks about them more in depth over in the article, but I just wanted to chime in briefly and agree wholeheartedly with the recommendations (I haven’t read Lagerqvist, but it sounds excellent). Of the four I have read, Yang and MacKinnon’s books have quickly become canonical studies of how citizens use the Internet and the challenges they face in China; Brady’s gives deep insight into the structures and motivations that drive the Communist Party’s media strategies; and Spence’s book is that rare history book that moves like a suspense novel (only a slight exaggeration; it really is a fun read).

If the site were called “Seven Books,” two more I’d add would be Yuezhi Zhao’s Communication in China and Daniella Stockmann’s just released Media Commercialization and Authoritarian Rule in China. Chapter 6 in Zhao’s book, “Challenging Neoliberlism” is worth the price of admission alone. It’s a fascinating case study and breakdown of the offline and online backlash to the secret New Western Hills meeting. If you want to grasp the contradictions inherent in everyday political life in China, this is a good place to start. I haven’t read Dani’s book yet, but by all accounts it is brilliant, and builds off her work on the market’s effect on journalism and the media in China (topics closely related to Ying Zhu’s book on CCTV, Two Billion Eyes, that I worked on). It’s got a hefty academic price tag, but now that I have access to a university library again, it’s first up in my queue.

Happy summer reading!

Wikipedia may be hesitating to switch to HTTPS-only because they fear they could be blocked completely in China. The fact that the censors have not fully blocked Gmail and Github, which have already switched to this HTTPS-only approach, speaks against this. On the other hand, the fact that Wikipedia has been fully blocked in the past shows that it’s a possibility. We argue that even if Wikipedia is blocked, that is better than the current, censored version. The reason that Wikipedia is better than for example Baidu Baike is that it’s not censored. By allowing the authorities to selectively censor articles, that whole argument is lost. Wikipedia should take a bold step clearly showing that they do not accept any level of censorship.
The good folks at make a forceful argument that Wikipedia should enable HTTPS by deault in China in order to prevent the continued censorship of sensitive articles (more background | 中文).

<Censoring a commemoration: what June 4-related search terms are blocked on Weibo today>

As citizens in China and around the world commemorate the twenty-fourth anniversary of the June 4th incident in Beijing’s Tiananmen Square, Internet censorship in China around this sensitive date has now become expected and almost routine. Though, as Tech in Asia notes, the censorship this year likely won’t be as intense as it was during the twentieth anniversary—when hundreds of sites went down for so-called “Internet maintenance”—and as websites consider more sophisticated forms of filtering out June 4-related posts, much overt censorship will still take place on sites behind the Great Firewall, including seemingly trivial steps like removing the candle emoticon from being inserted into Sina Weibo posts.

Another way the social media site Sina Weibo censors its site—alongside manual deletions by human censors of sensitive content—is by blocking the user from searching for specific keywords, and instead returning a message that says no results can be displayed. Though the blocking of keywords is a blunt tactic that often cuts off access to many legitimate posts—in addition to sometimes being ineffective as users switch to homophones or other code words—it is still widely employed on the site. Below are seventy-one keywords (along with brief translations and notes) that are currently blocked from searching on Sina Weibo.  

direct link

I performed this test by utilizing research by Jeffrey Knockel into words that trigger surveillance and censorship on Sina UC and Tom-Skype. I grabbed his list of known sensitive words related to June 4 on those chat clients and tested them on Sina Weibo on June 3, 12:00 PM EST. The notes and translations above were provided by The Citizen Lab (with additions and edits by me).

[cross-posted at The Citizen Lab]

快闪党 (flash mob / kuài shǎn dǎng) is a “public gathering of complete strangers, organized via the Internet or mobile phone, who perform a pointless act and then disperse again.” Though the concept has existed in the past, the modern version was popularized by former Harper’s editor Bill Wasik, who organized a series of gatherings throughout New York City in 2003. They were mostly social experiments or a sort of performance art, and soon spread across the globe. This is in contrast to a “smart mob,” which is more directed and typically has a goal, the dîner en blanc phenomenon for instance, where people dress in all white and gather at specified locations for a secret dinner.

Why it is blocked: Even though most flash mobs do nothing more harmful than show off a few Michael Jackson pelvic thrusts, Chinese authorities still fear the idea of large numbers of people organizing in public spaces, perhaps viewing it as training for future political gatherings (the distinction between a flash mob and a protest hinges on the intention, but execution-wise, they are quite similar: see for instance the 散步 / “take a walk” demonstrations). Flash mobs, though often harmless and playful, have caused disorder and even violence in other countries, a situation Chinese authorities no doubt are keen to avert. (Flash mob is currently blocked on Weibo.)

Blocked on Weibo: What Gets Suppressed on China’s Version of Twitter (And Why)

I don’t think I’ve mentioned it here on this blog yet, but I’m excited to announce that a book I wrote is coming out this summer. (Above is an advance reader’s copy that my publisher The New Press shared.) It’s basically a version of this blog, also aimed at giving general readers the context for why certain topics in China are sensitive. There are over 150 entries, about a 100 of which are brand new, and the others which come from this blog are updated. You can pre-order online now at your favorite online store or you can pick it up at your local bookstore in August. As we get closer to the publication date, I’ll start posting entries from the book more regularly. Thanks to everyone for their support of this project over the past year: couldn’t have done it without you Tumblr and everyone else who follows this blog!

<Where do Weibo users live? City and provincial breakdown of various Chinese Internet statistics>

They live in Guangdong (well, many of them do at least):

Some background: Now that I finally got around to playing with Weibo’s API, I’ve been collecting (you might call it hoarding…) a lot of fun data. I’m currently engrossed in this dataset I’ve developed of anti-Japanese comments and I’ve been doing a lot of spatial analysis—all of which is only possible because Weibo neatly provides a wealth of detailed location data included with every post/comment. Whereas Twitter offers whatever location a user supplies (“In your head”; “Your mom’s house”) along with a time zone (geo-coordinates and detailed location info are only available on a tiny percentage of tweets), Weibo’s API neatly gives you every user’s province, city code, and chosen location. The options are selected, not filled-in, so the data is super clean and crisp (well, outside of people who lie about their location).

Thus, seeing as it might be helpful for my other projects to know where Weibo users are blogging from (or at least say they are), I conducted a data expedition, grabbing the latest 200 posts from Weibo every five minutes for one full week. After discarding repeat messages (Weibo’s API doesn’t guarantee the posts are the absolute most recent, though for the most part, the majority of the posts matched my download date-time), I came up with a sample of 283,109 unique users, 236,611 of whom live in mainland China and which I used to generate the map above and chart below (this whole exercise was basically an excuse to show off some of Google’s super easy-to-use Fusion tables and an unnecessary distraction to my thesis writing, sigh).

direct link

Read More

Wen Jiabao (“温家宝”) unable to be posted on Weibo; error message returned

I’m not certain when this began, but as of right now, you can’t post any message on Weibo with Wen Jiabao’s name (“温家宝”). Doing so returns the following message (full size image):


Rough translation: Sorry, this content violates “Sina Weibo’s Community Administrative Rules" or other related regulatory policies, and we’re unable to execute the intended action. If you need assistance, please contact customer service.

FreeWeibo shows posts containing Wen Jiabao still being deleted today. Searches for Wen’s name have been blocked continuously for some time now (he was unblocked briefly during the Party Congress and for the ten days after), but being unable to post his name at all is another more extreme step. Attempting to post “彭博社” (Bloomberg) also returns the same error message. By comparison, I checked several hundred other sensitive politician’s names in the past week and no one else had this form of censorship. Can folks confirm that they are unable to post 温家宝 on their end as well?