Last week, a researcher during The Citizen Lab’s annual Connaught Summer Institute workshop raised an interesting problem. She wanted to test for censorship on a Chinese online service, and she had somewhat limited resources and time. What keywords should she use for her test?
In theory, this is a solved problem, what with the numerous lists of censored and sensitive Chinese keywords available on the web, including those shared by this site. However, sometimes the keyword list may be too broad for one’s taste, or may simply have too many keywords to efficiently use. And plus, what if I only want to test the most sensitive of the keywords, e.g., Falun Gong, June 4, Xi Jinping, and so on? For those not experienced in Chinese or Internet censorship, this can be a daunting task to winnow down already existing lists to something more usable.
Thus, a few of us sat down at the workshop and we collected 8 known Chinese keywords lists (see below) and aggregated them together in a single, easily share-able and sortable file, which we’ve posted to Github. The CSV files contain not just the keywords, but all sorts of other info like translations and tags (though not all of them; it’s an ongoing project which you are welcome to contribute to since it’s an open-source project).
As of Aug 4, there are 8,087 sensitive keywords collected from 8 different lists. To get a sense of what data is included in these CSV files, you can view a spreadsheet of these 8,087 keywords sorted by the number of lists they appear on.
To follow future changes to these lists, you can follow the Github repository. You are encouraged to adapt and update these lists as you see fit, however please do credit back to the Github repo if you do. Hopefully this is helpful to researchers who are searching for sensitive content in Chinese or testing for network interference.