Last week, a researcher during The Citizen Lab’s annual Connaught Summer Institute workshop raised an interesting problem. She wanted to test for censorship on a Chinese online service, and she had somewhat limited resources and time. What keywords should she use for her test?
In theory, this is a solved problem, what with the numerous lists of censored and sensitive Chinese keywords available on the web, including those shared by this site. However, sometimes the keyword list may be too broad for one’s taste, or may simply have too many keywords to efficiently use. And plus, what if I only want to test the most sensitive of the keywords, e.g., Falun Gong, June 4, Xi Jinping, and so on? For those not experienced in Chinese or Internet censorship, this can be a daunting task to winnow down already existing lists to something more usable.
Thus, a few of us sat down at the workshop and we collected 8 known Chinese keywords lists (see below) and aggregated them together in a single, easily share-able and sortable file, which we’ve posted to Github. The CSV files contain not just the keywords, but all sorts of other info like translations and tags (though not all of them; it’s an ongoing project which you are welcome to contribute to since it’s an open-source project).
As of Aug 4, there are 8,087 sensitive keywords collected from 8 different lists. To get a sense of what data is included in these CSV files, you can view a spreadsheet of these 8,087 keywords sorted by the number of lists they appear on.
|Creator||Tested on/found from||# of keywords||Year||Method + source|
|The Citizen Lab||Sina UC||1,818||2013||reverse engineered from the client; more analysis here; download link|
|The Citizen Lab||Tom-Skype||2,574||2013||reverse engineered from the client; more analysis here; download link|
|The Citizen Lab||LINE||673||2014||reverse engineered from the client; more analysis here; download link|
|Jason Q. Ng (Blocked on Weibo)||Sina Weibo||839||2013||running Wikipedia China article titles through Sina Weibo search; more analysis and book|
|Xia Chu||Great Firewall||669||2014||HTTP request scans of Wikipedia China articles to see if they’d trigger GFW block; more analysis here; download link (removed duplicates and keywords related to meta and user pages)|
|China Digital Times||Sina Weibo||2,448||ongoing||crowdsourced testing of suspected sensitive keywords on Sina Weibo; more analysis on CDT and in CDT’s Grass Mud Horse Lexicon e-book; download link|
|GreatFire.org||Wikipedia||488||2013||testing to see if Wikipedia pages are available in China; more info; download link|
|Google/ATGFW.org||Google/Great Firewall||456||2012||ATGFW.org and GreatFire.org reverse engineered the keywords Google was using to warn users of censorship while using their service in China; download link|
To follow future changes to these lists, you can follow the Github repository. You are encouraged to adapt and update these lists as you see fit, however please do credit back to the Github repo if you do. Hopefully this is helpful to researchers who are searching for sensitive content in Chinese or testing for network interference.