Showing posts tagged list

<List(s) of Chinese keywords for censorship testing and sensitive content collection>

Last week, a researcher during The Citizen Lab’s annual Connaught Summer Institute workshop raised an interesting problem. She wanted to test for censorship on a Chinese online service, and she had somewhat limited resources and time. What keywords should she use for her test?

In theory, this is a solved problem, what with the numerous lists of censored and sensitive Chinese keywords available on the web, including those shared by this site. However, sometimes the keyword list may be too broad for one’s taste, or may simply have too many keywords to efficiently use. And plus, what if I only want to test the most sensitive of the keywords, e.g., Falun Gong, June 4, Xi Jinping, and so on? For those not experienced in Chinese or Internet censorship, this can be a daunting task to winnow down already existing lists to something more usable.

Thus, a few of us sat down at the workshop and we collected 8 known Chinese keywords lists (see below) and aggregated them together in a single, easily share-able and sortable file, which we’ve posted to Github. The CSV files contain not just the keywords, but all sorts of other info like translations and tags (though not all of them; it’s an ongoing project which you are welcome to contribute to since it’s an open-source project).

As of Aug 4, there are 8,087 sensitive keywords collected from 8 different lists. To get a sense of what data is included in these CSV files, you can view a spreadsheet of these 8,087 keywords sorted by the number of lists they appear on.

Creator Tested on/found from # of keywords Year Method + source
The Citizen Lab Sina UC 1,818 2013 reverse engineered from the client; more analysis here; download link
The Citizen Lab Tom-Skype 2,574 2013 reverse engineered from the client; more analysis here; download link
The Citizen Lab LINE 673 2014 reverse engineered from the client; more analysis here; download link
Jason Q. Ng (Blocked on Weibo) Sina Weibo 839 2013 running Wikipedia China article titles through Sina Weibo search; more analysis and book
Xia Chu Great Firewall 669 2014 HTTP request scans of Wikipedia China articles to see if they’d trigger GFW block; more analysis here; download link (removed duplicates and keywords related to meta and user pages)
China Digital Times Sina Weibo 2,448 ongoing crowdsourced testing of suspected sensitive keywords on Sina Weibo; more analysis on CDT and in CDT’s Grass Mud Horse Lexicon e-book; download link
GreatFire.org Wikipedia 488 2013 testing to see if Wikipedia pages are available in China; more info; download link
Google/ATGFW.org Google/Great Firewall 456 2012 ATGFW.org and GreatFire.org reverse engineered the keywords Google was using to warn users of censorship while using their service in China; download link

To follow future changes to these lists, you can follow the Github repository. You are encouraged to adapt and update these lists as you see fit, however please do credit back to the Github repo if you do. Hopefully this is helpful to researchers who are searching for sensitive content in Chinese or testing for network interference.



<64 Tiananmen-related words blocked today (June 4, 2014)>

image

     (photo credit: CND.org?)

Today, on the 25th anniversary of troops being ordered into Tiananmen Square to clear student demonstrators, I tested several hundred June 4-related keywords on Weibo. I used the same set of keywords that I tested last year at The Citizen Lab, and I found relatively similar levels of censorship this year. Also like last year, the enhanced restrictions on keyword searches were apparently implemented specifically for the anniversary: for instance, 坦克 (tank) and 六四 (6-4, i.e., June 4) were free to be searched as recently as May 11.

A short article over at WSJ provides some more context and lists those 64 keywords that I identified as being blocked from searching on Weibo today. Of the keywords I tested, there were nine that were unblocked last year but have been added to the blacklist this year, including 八九 (89), 维多利亚公园 (Victoria Park, the site of the commemorative vigil in Hong Kong), and VIIV (roman numerals for June 4).

After the jump are the 64 keywords that are currently blocked from searching on Weibo (though some will no doubt be unblocked once this sensitive period passes):

Read More



<The Chinese keywords on messaging app LINE’s “bad words” list and why they are “bad”>

For an updated summary of the Citizen Lab’s excellent research into censorship on LINE to which I contributed, see this Nov 21, 2013 post.

Back in May, Twitter user @hirakujira was poking around in the code for Lianwo, the Chinese version of the popular mobile chat app LINE, when he noticed a curious line: “<key>warning.badWords</key>” followed by a string that read in Chinese: “Your message contains sensitive words, please adjust and send again.” Hirakujira subsequently identified the application files which contain these so-called “bad words” and posted them. The Next Web and Tech in Asia reported on how even though LINE (which is a Japanese spin-off of Naver, a Korean company) wasn’t yet actively censoring messages sent through its Chinese-branded app, the inclusion of such files indicated they had built in such a capability into the program—a forward-thinking move for any foreign content provider/distributor that hopes to succeed in China.

What I hope to do in the next few days is to take a closer look at the first roughly 40 of the 150 words that Hirakujira posted, translating and explaining the significance of those words with respect to current Chinese politics. Some are quite obvious, but others are quite obscure. By examining the words, we may hope to get a sense of what LINE thinks is worth censoring in order to appease their Chinese regulators. So for the next few days, consider this site rebranded as “Blocked on LINE (maybe in the future).”

The twenty-one posts (links will be added as the posts go up):

  1. 浙江签单哥Zhejiang’s receipt-signing Brother
  2. 警察杜平 / Police Dupin; 宣恩杀人现场 / Xuanen murder scene
  3. 叶迎春内衣 / Ye Yingchun underwear; 叶迎春 / Ye Yingchun
  4. 孙国相拆迁 / Sun Guoxiang demolition
  5. 中央领导内幕 / Central leadership insider
  6. 盘锦开枪 / Panjin shot; 四学者建言 / Four scholars suggestions; 
  7. 盘锦二表哥姜伟华Panjin, Second Watch Brother: Jiang Weihua; 姜伟华名表 / Jiang Weihua namebrand watches; 江诗丹顿 表叔 / Vacheron Constantin uncle
  8. 只身挡坦克 / Tanks block alone
  9. 爆料不孝女Expose: unfilial daughter; 爆料朱熹后人 竟是政协委员 / Expose: Zhu Xi’s descendants, suddenly CPPCC committee members
  10. 人大附中择校费杨东平Renmin High School, school choice fees; Yang Dongping
  11. 奥数叫而不停 / Complaints about Math Olympiad have not ceased
  12. 帝都 实行宵禁Imperial Capital implements night curfew
  13. 11月5日至15日 出租车禁行Nov 5 to 15 rental cars banned; 表叔 陈应春 / Uncle Chen Yingchun
  14. 江泽民被控制Jiang Zemin has been controlled; 江系军委被撤 / Jiang withdraws from Military Commission
  15. 张蓓莉200万耳环Zhang Peili 2 million RMB earrings; 温家 戴梦得 / Wen [Jiabao] Diamond; 温家宝 27亿 / Wen Jiabao 2.7 billion [USD]; 影帝温家 / Actor Wen Jiabao; 温家 资产700亿 / Wen Jiabao assets 70 billion
  16. 网络封锁 / Internet blockade
  17. 维族 砍人Uyghurs stab people
  18. 和田 暴乱 / Hotan rebellion
  19. 万鄂湘亚视 / Wan Exiang, Asia Television Limited
  20. 李正源李刚Li Zhengyuan, Li Gang; 交警夏坤 / Traffic cop Xia Kun
  21. 64屠城June 4 massacre


<Interactive charts showing changes in Weibo keyword censorship (Jun - Aug 2013)>

Thanks to the excellent work being done by researchers and journalists at China Digital Times, GreatFire.org, and many others, there has never been more information about what is being censored online in China. However, what is less discussed and written about are instances when the censors withdraw keywords or topics from their censorship watchlists.

Read more about what data is represented (and misrepresented!) in these charts in my Citizen Lab post: Visualizing Changes in Censorship: Summarizing two months of Sina Weibo keyword monitoring with two interactive chartsIf for some reason the charts don’t load (either because of rate limiting by Google or other reasons), the JavaScript code used to create the charts can be viewed here and here. You can then either host the file on your own site to interact with the charts or copy/paste it into the Google Code playground.

Unique China Chats words blocked or unblocked on Sina Weibo (click on the part of the bar and then “See all keywords” that fall into that category on that test date):

Unique China Chats keywords with changes in block staus:



<480 keywords blocked from searching on Weibo as of Jun 29, 2013>

During the past month, I’ve been working as a summer research fellow at The Citizen Lab in Toronto. It’s been great to not only have time to dedicate to updating this blog and pushing forward collaborative projects with researchers I’ve been fortunate enough to meet over the past two years, but also to pitch in with all the amazing work being done here at this one-of-a-kind lab. Among the projects I’ve been helping out with is one pertaining to the list of censorship and surveillance keywords in the Chinese chat clients TOM-Skype and Sina UC, which the team decrypted then analyzed in collaboration with Jed Crandall and Jeff Knockel at the University of New Mexico. 

Of course, my first desire was to take the keywords they extracted and to test them on Weibo. Below are 480 unique keywords which were blocked from searching on Sina Weibo as of June 29. I’ve written more about the other censorship games I’ve detected in this post over at The Citizen Lab’s blog. Among the things I discuss are the overlap of keywords between different Internet services in China as well as what drastic changes in the number of search results for keywords might mean.

A full spreadsheet of the data mentioned in the report can be viewed in this Google Fusion Table or downloaded in .csv format for further analysis by all you researchers reading along at home. I look forward to sharing other relevant work my colleagues and I get done at the Lab during the rest of the summer.



<Censoring a commemoration: what June 4-related search terms are blocked on Weibo today>

As citizens in China and around the world commemorate the twenty-fourth anniversary of the June 4th incident in Beijing’s Tiananmen Square, Internet censorship in China around this sensitive date has now become expected and almost routine. Though, as Tech in Asia notes, the censorship this year likely won’t be as intense as it was during the twentieth anniversary—when hundreds of sites went down for so-called “Internet maintenance”—and as websites consider more sophisticated forms of filtering out June 4-related posts, much overt censorship will still take place on sites behind the Great Firewall, including seemingly trivial steps like removing the candle emoticon from being inserted into Sina Weibo posts.

Another way the social media site Sina Weibo censors its site—alongside manual deletions by human censors of sensitive content—is by blocking the user from searching for specific keywords, and instead returning a message that says no results can be displayed. Though the blocking of keywords is a blunt tactic that often cuts off access to many legitimate posts—in addition to sometimes being ineffective as users switch to homophones or other code words—it is still widely employed on the site. Below are seventy-one keywords (along with brief translations and notes) that are currently blocked from searching on Sina Weibo.  

direct link

I performed this test by utilizing research by Jeffrey Knockel into words that trigger surveillance and censorship on Sina UC and Tom-Skype. I grabbed his list of known sensitive words related to June 4 on those chat clients and tested them on Sina Weibo on June 3, 12:00 PM EST. The notes and translations above were provided by The Citizen Lab (with additions and edits by me).

[cross-posted at The Citizen Lab]



<378 words that are blocked on Weibo as of March 13, 2012>

I decided to run a re-test of my initial list of blocked words this morning. Below you’ll find 378* keywords that are blocked as of March 13, 2012. (Note: these are words blocked on Sina Weibo; this is not a list of words blocked by the Chinese government. Please read this Disinfo article before re-using content in this post.)

direct link

Of the 1300 mostly unique words I found to be unsearchable in my initial test in Nov/Dec 2011, 933 were subsequently unblocked some time in late-January to early-February 2012. But apparently, that was an overreach and as of this morning, 393 of those 933 have been re-blocked (words which include 五毛 [Fifty Cent Party], 轮奸 [gang rape/gangbang], and 梯恩梯 [TNT], among others). I want to double-check and confirm that some of the longer length words are indeed unique (that is, verifying what the root words are that cause them to be blocked), so in this list you’ll only find words that are four characters and less (though I noticed after the fact there are some non unique words; for instance there are a few with 八八). I added in a few longer English words that I thought were of note along with some others from another final Wikipedia list that I generated, giving us the above 378 words that are blocked as of this morning. Please note, these are terms that when you try to search for on Weibo, you receive an error message. As far as I know, you are free to post these words in a message. (Of course, there is the potential for censoring after the fact…**)

For more about this project and how the Chinese government persuades Internet companies to self-censor, you can read my article up at Waging Nonviolence

*Update: Forgot a few numbers like 64, 八八, and 1989. I’ve appended them to the bottom, but also removed a number of non-unique words I spotted after the fact (I left a few of the more interesting ones in) so this list now comprises 343 words. This list is a filtered subset of the total 1,574 words I uncovered to be blocked in early-2012, which you can view/download.

**See this Carnegie Mellon study and browse WeiboScope by the Journalism and Media Studies Centre at the University of Hong Kong for more on Weibo posts deleted by censors.



<Search result logs and full list of banned words>

UPDATE 3/21/12: Because the initial Google translations provided for these words have been misinterpreted as their actual meanings, I’m taking down the full list of words I’ve uncovered thus far. If you still really want to see them, you can maybe do a Wayback search for this page. Otherwise, please read my blocked word posts wherein I’ve provided proper translations and context for why they are blocked on Weibo. The sample of words presented here are words blocked on Sina Weibo; these are not a list of words blocked by the Chinese government. The list also changes frequently, so the words posted here may have since been unblocked. This is not a list of ALL words blocked on Weibo, merely the ones I found in my searches. Please read this Disinfo article before re-using content in this post.

UPDATE 2/25/12: I finally finished searching through the 700,000 Chinese Wikipedia keywords last month and have found roughly 1000 words to be blocked (you can download/view a list of all 1,574 here).

Sample of blocked words

direct link | full post and summary

direct link | full post and analysis