I have been collecting for a while the titles of papers accepted in top system security conferences. I kept them all in text format on my laptop, so that I can easily search for keywords using grep.

Few months ago I was on a plane flying back from the US and I thought that it would have been interesting to extract some statistics from the data I collected. It sounded like a good way to spend few hours (it turned out I was wrong by two orders of magnitude) and so I started writing some python code to parse the info and plot some graphs. At the end I ended up spending a considerable amount of time on this task, so I thought it was worth putting the results online.

UPDATE

Since people started to point out some errors, I need to version this document.
So, here it is (major changes if I add new stats, minor if stats are modified, build number for typos and spellings mistakes):
CURRENT VERSION: 1.7.0

The dataset

The dataset includes all papers published since 2005 in four conferences: ACM CCS, Usenix Security, IEEE Security & Privacy (Oakland), and NDSS. These conferences (commonly referred to as the Top-4) are perfectly distributed over the year and are tightly connected in a cycle - with the notification of one conference preceding by only few days the deadline of the next one. Whether this is good or bad (and I tend towards the second) it is not the topic of this post.

Instead, lets get to the important questions..

  • Is the dataset representative?
    I think so. CCS, Oakland, Usenix Security, and NDSS are the four top conferences in the system security area. I am pretty sure everyone in the field would agree with me. However, they are certainly not the only ones. Other conferences are good and with a very long history (e.g., ACSAC), or they are top in other domains but still accept system security papers (e.g., OSDI and WWW). However, overall I think that the Top-4 are the best option to capture the research and the community in system security.
    Sure, some of these conferences (mainly CCS and Oakland) are main venues for the security community at large, including the crypto and applied crypto guys. This can introduce a bias in some of the results.. but I guess we need to live with it because I am not going to manually remove non-system papers from the dataset.

    So far I covered the last decade (11 years to be precise) to have a decent historical view and enough data to plot some graphs. Why not the last 15? Well, I will probably extend the dataset when I have some spare time.

  • Is the dataset clean?
    I think there is not such a thing as a clean dataset. I collected the information from the conference websites and, when affiliations were not available, by downloading and manually parsing the PDF of the papers. This process is certainly prone to cut-and-paste errors.

    Many authors have a name that contains non-ASCII characters, often spelled differently in different papers. So I translated everything to the closer ASCII representation. But also in the ASCII world there were problems. Are "Xin Cheng Zhang", "Xin C. Zhang", and "Xin Zhang" the same person? (Looks like) Then there are the undecided: those that lost their middle name over the years or that spell their name differently in each paper (Srinivas or Srini, Chris or Christopher, Mike or Michael, Sal or Salvatore, Yoshi or Tadayoshi, etc.). Some people seem to have fun to randomly combine non-ascii characters, abbreviations, different spellings, one or more middle names, ending up spelling their name differently every time. Cmon…
    I wrote a little script to detect these suspicious cases, and I manually fixed few hundreds of them. Did I catch them all? I am sure I didn’t. And I am sure I made some mistakes as well — but I did the best I could in the time I had.
    Finally, the authors affiliations were a giant mess. Simple things like "UC" vs "University of California" were easy to fix. Sometimes authors added the name of the lab or department, and I had to google them to find out if they were or not a different institution. Finally, companies made things even more complex. I tried to keep branches in different country separated (e.g., IBM India separated from T.J. Watson). But sometimes this was very hard and I’m sure I made several mistakes.

Long story short, the are inaccuracies in the data and bugs in the code.
Think about weather forecasts: people are trying very hard to get them right, the results are not completely random, but I would not bet me career on them.


At the moment, the dataset includes 44 venues, for a total of 1912 papers written by 3373 authors from 541 affiliations.

There are few things that struck me when I looked at the these statistics, but, as in any large enough dataset, I am sure everyone can find enough evidence to support or refuse any hypothesis.

Let’s start by one of the main comment you hear over and over if you work in this area: the number of submissions to the Top-4 is increasing every year and therefore it is getting harder to get a paper accepted.

So, let’s start looking if this is true.

./submission_tot.png ./acceptance_rate.png

These figures clearly show two things. The first is that indeed the number of submissions is rapidly increasing - even more than linearly. However, the graph on the right shows that the acceptance rate is not going down. Actually, the percentage of accepted papers has slightly increased and it is easier to get a paper accepted in 2015 than it was ten years ago. However, this came with a price: all the Top-4 are increasing in size and they are reducing the time slots for presentations. Moreover, increasing the size of the conferences is only a temporary solution and does not address the core of the problem. Adding one more top conference (as they are now trying to do with Euro S&P) could also be beneficial, but only if they try to distribute the load instead of competing for a spot in the loop.

In the next step I wanted to understand who is submitting all those papers. In this case, the general feeling is that the system security community is growing. However, this could mean more groups working in system security, or the fact that the main groups are increasing in size and therefore they are submitting more and more papers.

./new_auth.png ./active.png ./new_affl.png

The left graph confirms that the community is growing with more and more people joining the field and publishing in the Top4. But students graduate and move on, so this could just be the result of the natural Phd cycle. So, in the second graph I plotted the number "active" researchers - where by active I mean those people who have published at least once (in the Top4) in the past two years. This confirms that the community is definitely increasing in size.

What I found more surprising is the plot on the right. Until 2010 the number of new institutions was decreasing, meaning that we were reaching a plateau in which the new people were part of organizations already active in the area. However, after 2010 we observed a rapid increase in the number of new organizations, showing that the community is growing also because more companies and universities are joining the field every year.

./authors_x_paper.png ./single_affl.png

More organizations also means more collaborations. The average number of authors per paper is increasing (blue line on the left graph) and the number of papers written by a single institution is instead rapidly decreasing (right graph). The trophy for the largest number of authors goes to "Manufacturing Compromise: The Emergence of Exploit-as-a-Service" with 18 names from six different institutions (red line on the left graph shows the max per year). At the other end of the spectrum, I counted 39 single-author papers in the dataset. In this small club, only three authors appear twice: Michael Goodrich and Ben Adida (both with only single-author papers) and Florian Kerschbaum.

Yes, the system security community is growing fast.
New researchers and institutions join the field every year. This also translates into an increasing number of submissions to the top conferences.

Geolocation

./us_vs_world.png ./country_ratio.png

Now this is more interesting. Ten years ago the system security field was dominated by the US. However, since then the rest of the world (Europe in particular) has been growing and the international collaborations are now very common, accounting for more than 20% of the papers.

Another way to look at it is by checking the last affiliation of every author in the list to get a total headcount per country. The pie chart on the right shows every country that is over the 1% mark. Looks like USA alone has still around 63% of the researchers in system security (European Union, are you listening? This needs to change) and Germany has a solid second place. Interestingly, both countries have good national funding and long Ph.D. studies (5years and sometimes more). Coincidence?

The United States are still leading but the rest of the world is slowly catching up.
More importantly, more and more papers are the result of international cooperations.. and this is a good thing.

Program Committees


Nothing too surprising from the program committee memberships. To respond to the increasing number of submitted papers, all conferences adopted larger committees. From 2005 to 2015, NDSS and Oakland doubled the size; CCS tripled, and Usenix’s technical committee increased by almost four times.

Overall, almost 600 people served in the Top4 committees since 2005. Around 28% of them never published a paper in these venues in the same period. This could be good (different point of views, more people from industry, …) and bad (are they all able to judge the quality of a paper?) at the same time.

Another good thing is that the workload for the reviewers has remained more or less constant, as you can see from this figure:

./committees.png

If we count an average of three reviews per paper (2 for early rejects, 4+ for the controversial papers), the average is around 21 reviews per PC member. Quite a lot of work done (for free) for the community.

For the moment, the Top4 were able to cope with the increasing number of submissions by the increasing the size of their technical committees and by accepting more papers. But is this sustainable in the long term? What else can we do?
More conferences? More parallel tracks? Sub-areas with separate committees like WWW?
Oakland 2016 is trying to discourage multiple submissions from the same authors.. but is this the real problem?

People and Affiliations


If you are curious about who are the most influential researchers in the field (just according to the paper count!) here is the top ten:


Rank Name Papers Max Chair TCP Slams Venues Top4 Co-Authors Avg Last Affiliations

1

Christopher Kruegel

53

4

1

19

3

32

4

83

4.87

University of California - Santa Barbara

2

Dawn Song

49

4

0

6

2

31

4

101

4.49

University of California - Berkeley

3

Wenke Lee

45

3

2

24

5

32

4

74

4.76

Georgia Institute of Technology

4

Giovanni Vigna

41

4

2

12

2

26

4

64

4.95

University of California - Santa Barbara

5

XiaoFeng Wang

38

5

0

11

2

25

4

76

5.21

Indiana University - Bloomington

6

Dan Boneh

30

2

1

11

0

22

4

75

4.40

Stanford University

7

Engin Kirda

28

3

1

17

0

23

4

46

4.61

Northeastern University

7

Michael K. Reiter

28

4

0

18

0

20

4

44

3.79

University of North Carolina - Chapel Hill

9

David Brumley

26

3

0

12

0

19

4

48

4.00

Carnegie Mellon University

10

Elaine Shi

25

4

0

5

1

13

4

53

4.68

University of Maryland - College Park ; Cornell University

10

Fabian Monrose

25

2

1

16

2

22

4

37

4.24

University of North Carolina - Chapel Hill

The complete list (with sortable columns) that includes everyone with at least two publications is available here.

Legend:

  • max : max number of papers in the same venue

  • chairs : number of Top-4 venues for which the person was PC Chair

  • slams : number of times in which the author had a paper in all the Top-4 conferences in the same year.

  • tpc : number of technical program committee memberships

  • venues : total number of venues in which the author had a paper

  • top4 : in how many of the Top-4 the authors had a paper (1 to 4)

  • co-authors : total number of co-authors in Top-4 publications

  • avg : average number of authors per publication

Playing with the full table you can notice several interesting things.
Like the fact that the maximum number of papers an author got in a single conference is 5 (but more than 10 people got to 4!). Or that Dawn Song is the professor with the largest number of co-authors (101). Overall, 235 people published more than five times in the Top-4. As a comparison, the The System top 50 reports 53 people over this threshold in the System community. But that number was computed for 33 venues spanning many many years.. so it is hard to make a comparison.

Recently Oakland introduced some new rules to try to limit (or keep an eye on) the number of papers submitted by a single author. This is something that is often discussed by Steering Committees, probably because there is a general feeling that large groups flood conferences by resubmitting many papers over and over. I do not have the data to either support (or disprove) this point. However, I can plot the max number of paper per venue over the years:

./max_papers_per_person.png

Maybe it means nothing, but it looks like top researchers always had many papers accepted (and therefore likely submitted) to top conferences. This is not a new (or increasing) trend.

Now, the top ten institutions:

Rank Name Papers Coverage Size Country

1

Carnegie Mellon University

141

95.5

137

USA

2

Microsoft Research - US

118

86.4

88

USA

3

University of California - Berkeley

114

95.5

111

USA

4

Georgia Institute of Technology

81

86.4

71

USA

5

University of Illinois - Urbana Champaign

66

77.3

66

USA

6

Stanford University

65

75.0

82

USA

7

University of California - Santa Barbara

63

79.5

51

USA

8

University of California - San Diego

60

72.7

65

USA

9

University of Maryland - College Park

56

56.8

59

USA

10

Columbia University

49

61.4

48

USA

10

Indiana University - Bloomington

49

59.1

39

USA

(the complete list with sortable columns is available here)

Legend:

  • coverage : percentage of venues in which the institution had at least one paper

  • size : number of Top-4 authors affiliated to the institution

A few things worth nothing:

  1. Every single entry in the top ten is from the US (the first institution outside the US is ETH in position #13)

  2. CMU and Berkeley had a paper in almost every single venues

  3. There is only one company in the top ten and nine universities (3 of them part of University of California). However, this is a consequence of the fact that I geographically split some large corporations in different branches. If you sum all the research centers you get enough publications to move IBM to the top 10, and Microsoft to the first place!!

  4. None of the top companies doing research in system security is a cybersecurity company.

  5. CMU and Berkeley have the most people on the authors list and are the only above 100 (all MS researchers together would still be below CMU).

  6. Where are the cyber-security companies? (RSA is first in position 35)
    Being a "system" community, it would be nice to see a mix of academia and industry. Is our community still too small? Are the companies only interested in BlackHat and Virus Bulletin? Or are Microsoft and Google our main security companies?
    It is sad but but I am afraid that in system security academia thinks companies don’t do research, and companies think academic papers are only theoretical exercises and professors are completely unaware of what happens in the real world and what has been previously done outside the close world of the scientific conferences. I hope I am wrong.

Even though there are almost 600 unique affiliations in the dataset, the cumulative distribution function show that few institutions are basically running the show. In fact, the top 10% of the institutions account for over 77% of the papers.

./institutions_cdf.png

Groups and Collaborations


Another fun thing I can do is build a giant graph with all the authors as nodes and edges to show if two researchers worked together (i.e., had at least a shared paper). Unfortunately, the graph contains over 3K nodes and 11K edges, so it is not very easy to plot. What is interesting is the fact that almost 85% of the authors are part of a giant connected component! In other words, you can move from co-author to co-author and reach almost everyone in the field. The most prolific author who is not connected to the giant component? There are only nine of them with more than three papers, and the group is lead by Ralf Kuesters with eight.

The largest clique (i.e., the largest subgraph in which everyone is directly connected with everyone else) corresponds to the 18 authors of the "Manufacturing Compromise: The Emergence of Exploit-as-a-Service" paper. Interesting that there is nothing larger than that.

If we want to get something more compact, we can reduce the graph by keeping only the authors who published a minimum number of papers. With a threshold of 9 papers we get a completely connected graph with 114 nodes, 354 edges, and a diameter of 8. Still too crowded for a nice picture.

If you want the main groups to emerge from the graph, we need to follow a different approach to remove the noise. For instance, we can emphasize the strong collaborations by limiting the analysis to authors with at least five papers in common (independently from the absolute number of papers they published), and then plotted the groups with at least three individuals.

Finally, here you can see them main connections (big picture here), and clearly identify 19 clusters:

./main_groups.png

Still too big for your taste?
Ok, this is the same graph plotted with a threshold of 10 papers. It only has 15 nodes, but show the most productive collaborations between professors in our field.

./connected_graph_10.png

Before You Ask


  • I noticed a mistake in the authors or affiliations list
    Great, please let me know and I’ll fix it and re-run the stats. You can find my email in my homepage.

  • Wait a second. Why didn’t you use DBLP ? TopResearcher uses DBLP to extract the list of all the papers in several top conferences, including the Top4. However, Kruegel has 30-something papers in that list, and over 50 in mine. The difference is probably due to wrong spelling, but to be safe I preferred to re-parse the names myself.
    Moreover, I also wanted all the affiliations for each author, and that information was not available in DBLP.

  • It would be nice to have dynamically-generated pages and interactive plots
    Yes, I agree. What would be even nicer is to use a library to nicely display large customizable graphs. Something like yFiles would be fantastic for this purpose. But it is not free even for academia. So, unless you want to give me a license, this is not going to happen :(

  • Can you share your data/code?
    All data I used is public, just check the programs on the homepage of the conferences. I fixed some of the data itself, but I also applied transformations and normalizations in the code. However, the code is too ugly and fragmented to share at the moment. I will share it after I fix it and I’m done adding stuff (I have few more ideas in mind)

  • Will you keep this page up to date?
    Probably. It would be nice to go back to 2000 and then update it once a year after CCS publishes the accepted papers. Let’s see..

  • Do you think this can be used as a global cyber-security ranking?
    What? Absolutely not!
    This page only counts the number of publications in the Top4 in the past ~10 years. Sure, this is a measure of success in the field, but it says absolutely nothing about the quality and the impact of the work.

  • You know what it would also nice to check…
    Please, tell me. I love plotting graphs :D