This is the first major update of my previous post (which you can still find here) that reports some statistics about the top four System Security conferences.

CURRENT VERSION: 2.1.0
(major version changes when I add new stats or new venues, minor if stats are modified, build number increases when I fix typos or spellings mistakes)

WHAT’s NEW
  • More Data: the dataset now covers 17 years and contains ~40% more papers then in the previous edition.

  • The affiliations table now comes in four versions: worldwide, US-only, Europe-only, and rest of the world.

  • The top-10% of the affiliations now have a dedicated page, showing plenty of information and a heatmap of its historic publication activity
    (see this example for ETH Zurich)

  • The authors page now includes a new column with the years of activity

  • This page includes more graphs and some new statistics.

The dataset

The dataset includes all papers published since 2000 in four conferences: ACM CCS, Usenix Security, IEEE Security & Privacy (Oakland), and NDSS. These conferences (commonly referred to as the Top-4) are perfectly distributed over the year and are tightly connected in a cycle - with the notification of one conference preceding by only few days the deadline of the next one (I still have to write a post to describe why this may not be a good idea).

If you are reading this for the first time, here are some important points:

  • Is the dataset representative?
    I think so. CCS, Oakland, Usenix Security, and NDSS are the four top conferences in the system security area. I am pretty sure everyone in the field would agree with me. However, they are certainly not the only ones. Other conferences are good and with a very long history (e.g., ACSAC), or they are top in other domains but still accept system security papers (e.g., OSDI and WWW). However, overall I think that the Top-4 are the best option to capture the research and the community in system security.
    Sure, some of these conferences (mainly CCS and Oakland) are main venues for the security community at large, including the crypto and applied crypto guys. This can introduce a bias in some of the results.. but I guess we need to live with it because I am not going to manually remove non-system papers from the dataset.

    The dataset contains the last 17 years of these conferences. I considered extending the dataset even further, but I am not convinced it is a good idea. Was NDSS even considered a top conference in the 90’s?

  • Is the dataset clean?
    I think there is not such a thing as a clean dataset. I collected all the information from the conference websites and, when affiliations were not available, by downloading and manually parsing the PDF of the papers. This process is certainly prone to cut-and-paste errors (both mine and from the conference website maintainers). Moreover, I noticed that sometimes whoever prepared the program used the paper title from the submission, which sometimes was later changed (even substantially) in the camera ready.

    Many authors have a name that contains non-ASCII characters, often spelled differently in different papers. So I translated everything to the closer ASCII representation. But also in the ASCII world there were problems. Are "Xin Cheng Zhang", "Xin C. Zhang", and "Xin Zhang" the same person? (Looks like) Then there are the undecided: those that lost their middle name over the years or that spell their name differently in each paper (Srinivas or Srini, Chris or Christopher, Mike or Michael, Sal or Salvatore, Yoshi or Tadayoshi, etc.). Some people seem to have fun to randomly combine non-ascii characters, abbreviations, different spellings, one or more middle names, ending up spelling their name differently every time. Cmon…
    To make things worst, spelling errors (like "Hebert" instead of "Herbert") are even harder to spot.

    I wrote a little script to detect all these suspicious cases, and I manually fixed few hundreds of them. Did I catch them all? I am sure I didn’t. And I am sure I made some mistakes as well — but I did the best I could in the time I had.
    Finally, the authors affiliations were a giant mess. Simple things like "UC" vs "University of California" were easy to fix. Sometimes authors added the name of the lab or department, and I had to google them to find out if they were or not a different institution. Finally, companies made things even more complex. I tried to keep branches in different country separated (e.g., IBM India separated from T.J. Watson). But sometimes this was very hard and I’m sure I made several mistakes.

Long story short, the are inaccuracies in the data and bugs in the code.
Think about weather forecasts: people are trying very hard to get them right, the results are not completely random, but I would not bet me career on them.

Community Size and Number of Submissions


At the moment, the dataset includes 68 venues, for a total of 2675 papers written by 4460 authors from 699 affiliations located in 51 different countries.

Let’s start by looking at the number of submissions. One of the main comment you hear over and over if you work in this area is that the number of submissions to the Top-4 is increasing every year and therefore it is getting harder and harder to get a paper accepted.

So, let’s start looking if this is true.

./submission_tot.png ./acceptance_rate.png

These figures clearly show two things. The first is that indeed the number of submissions is rapidly increasing - certainly more than linearly! However, the graph on the right shows that the acceptance rate is not going down and the chairs are doing exceptionally well to maintain the rate in an acceptable range (which I consider between 10 and 20 percent).

However, this came with a price: all the Top-4 are increasing in size and they are reducing the time slots for presentations. Unfortunately, increasing the size of the conferences is only a temporary solution and does not address the core of the problem. Adding one more top conference (as they are now trying to do with Euro S&P) could also be beneficial, but only if they try to distribute the load instead of competing for a spot in the loop. Another solution could be to have multiple tracks (with separate committees) like WWW - but none of the security conferences seems to go in this direction.

In the next step, let’s look at who is submitting all those papers. In this case, the general feeling is that the system security community is growing. However, this could mean more groups working in system security, or the fact that the main groups are increasing in size and therefore they are submitting more and more papers.

./new_auth.png ./active.png ./new_affl.png

The left graph confirms that the community is growing with more and more people joining the field and publishing in the Top4. But students graduate and move on, so this could just be the result of the natural Phd cycle. So, in the second graph I plotted the number of active researchers - i.e., those people who have published at least once (in the Top4) in the past two years. This confirms that the community is definitely increasing in size.

What I found more surprising is the plot on the right. Until 2005 the number of new institutions was decreasing, meaning that we were reaching a plateau in which the new people were part of organizations already active in the area. Between 2005 and 2010 the number increased at a slow pace, but after 2010 we observed a rapid increase in the number of new organizations, showing that the community is growing also because more companies and universities are joining the circus every year.

./authors_x_paper.png ./single_affl.png

More organizations also means more collaborations. The average number of authors per paper is increasing (blue line on the left graph) and the number of papers written by a single institution is instead decreasing (right graph). The trophy for the largest number of authors goes to "Manufacturing Compromise: The Emergence of Exploit-as-a-Service" with 18 authors from six different institutions (the red line on the left graph shows the max per year). At the other end of the spectrum, I counted 97 single-author papers (3.6%) in the dataset. This small club is leaded by Niels Provos and Peter Gutmann with four single-authors papers each.

Conclusion? Yes, the system security community is growing fast.
New researchers and institutions join the field every year. This also translates into an increasing number of submissions to the top conferences which so far are responding by accepting more papers. This solution may not scale well on the long run.. let’s see.

A bit of Geography

./us_vs_world.png ./country_bars.png

Now this is more interesting. Ten years ago the system security field was dominated by the US. However, since then the rest of the world (Europe in particular) has been growing and the international collaborations are now very common, accounting for more than 20% of the papers.

Despite this, the plot on the right shows that - in terms of number of co-authored papers - the US is still very much ahead of everyone else.

The United States are still leading but the rest of the world is slowly catching up.
More importantly, more and more papers are the result of international cooperations.. and this is a good thing.

Program Committees


To respond to the increasing number of submitted papers, all conferences adopted larger committees. From 2000, NDSS doubled the size, Oakland tripled it, and CCS and Usenix’s technical committee became over five times larger.

Overall, 738 people served in the Top4 committees. Around 25% of them never published a paper in these venues in the same period. I guess this is good (different point of views, more people from industry, …) and bad (are they all able to judge the quality of a paper?) at the same time. On the other hand, 13% of the top-200 authors never served in any committee. Why not? Is no one calling them or they were not interested? We need to find out..

./committees.png ./tpc_overlap.png

As a result of the increase in size of the PCs, the workload for the reviewers has remained more or less constant, as shown in the left picture above. If we count an average of three reviews per paper (2 for early rejects, 4+ for the controversial papers), the average is around 21 reviews per PC member. Still quite a considerable amount of work done (for free) for the community. NDSS in particular had an almost constant increase, and now it is one of the most heavy-duty conference in the circus (while it was by far the lowest in 2000).

The right picture shows instead how each program committee changed over the years. With some ups and downs, the overlapping is typically between 20% and 50% - which I believe is a good trade-off, with enough fresh blood and enough previous members to ensure some continuity in the reviews.

For the moment, the Top4 were able to cope with the increasing number of submissions by increasing the size of their technical committees and by accepting more papers. But is this sustainable in the long term? What else can we do?
Oakland is still trying to discourage multiple submissions from the same authors.. but I’d love to have some data to see if this is a real problem. From 2017, they plan to move to a VLDB-like Process. Check here for more information.

Authors


If you are curious about who are the most influential researchers in the field (just according to the paper count!) here is the current top ten:


Rank Name Papers Active Max Chair TPC Slams Venues Top4 Co-Authors Avg

1

Dawn Song

65

00-16

4

0

9

3

44

4

114

4.18

2

Christopher Kruegel

61

02-16

4

1

20

3

38

4

92

4.97

3

Wenke Lee

53

01-16

3

2

28

5

38

4

89

4.79

4

Giovanni Vigna

52

02-16

4

2

18

2

35

4

82

5.06

5

XiaoFeng Wang

46

03-16

5

0

13

2

29

4

88

5.35

6

Michael K. Reiter

44

01-16

4

1

26

0

30

4

63

3.59

7

Adrian Perrig

40

00-16

4

2

18

1

29

4

66

3.98

8

Vern Paxson

39

00-16

2

2

17

1

30

4

90

5.38

9

Dan Boneh

38

01-16

2

2

16

0

29

4

93

4.32

10

Michael Backes

37

03-16

5

1

15

1

25

4

65

3.72

The complete list (with sortable columns) that includes everyone with at least two publications is available here.

Legend:

  • active : the first and last year in which the author published a paper in the circus

  • max : max number of papers in the same venue

  • chairs : number of Top-4 venues for which the person was PC Chair

  • slams : number of times in which the author had a paper in all the Top-4 conferences in the same year.

  • tpc : number of technical program committee memberships

  • venues : total number of venues in which the author had a paper

  • top4 : in how many of the Top-4 the authors had a paper (1 to 4)

  • co-authors : total number of co-authors in Top-4 publications

  • avg : average number of authors per publication

By playing with the full table you can notice several interesting things.
Like the fact that the maximum number of papers an author got in a single conference is 5 (but 14 people got 4!). Dawn Song is leading the ranking and also the list for the highest number of co-authors (114). Wenke Lee served in 28 program committees and he is the professor with the highest number of slams. I count only 5% of women in the top-100 (we need more!!) and only 13% of researchers based in Europe (we need more!!).

Overall, 334 people published more than five papers in the Top-4. As a comparison, the The System top 50 reports 53 people over this threshold in the System community. That number was computed over 33 venues (this document covers 68).

./max_papers_per_person.png

This final graph shows that over the past 17 years, the max number of papers a researcher had in a single venue is increasing. This could have several explanation.. but probably is a consequence of the fact that successful groups are getting larger and larger (it is not rare to hear groups with 15 or more Ph.D. students in security).

Affiliations


Now, let’s have a look at the top ten affiliations:

Rank Name Papers Coverage Size Country

1

Carnegie Mellon University

174

83.8%

162

USA

2

University of California - Berkeley

154

92.6%

129

USA

3

Microsoft Research - US

138

69.1%

103

USA

4

Georgia Institute of Technology

102

64.7%

91

USA

5

Stanford University

100

77.9%

110

USA

6

University of Illinois - Urbana Champaign

81

63.2%

86

USA

7

University of California - Santa Barbara

80

66.2%

70

USA

8

University of Maryland - College Park

78

51.5%

69

USA

9

University of California - San Diego

77

63.2%

79

USA

9

University of Michigan - Ann Arbor

77

58.8%

81

USA

(the complete list with sortable columns is available here)

Legend:

  • coverage : percentage of venues in which the institution had at least one paper

  • size : number of Top-4 authors affiliated to the institution

A few things worth nothing:

  1. Every single entry in the top ten is from the US (the first institution outside the US is ETH in position #18)

  2. Berkeley had a paper in almost every single venues (over 90%)

  3. There is only one company in the top ten and nine universities (3 of them part of University of California). However, this is a consequence of the fact that I geographically split some large corporations in different branches. If you sum all the research centers you get enough publications to move IBM to the 4th place and Microsoft to the second.

  4. CMU is the institution with the highest number of authors.

  5. Where are the cyber-security companies? (RSA is first in position #31)
    Being a "system" community, it would be nice to see a mix of academia and industry. Is our community still too small? Are the companies only interested in BlackHat and Virus Bulletin? Or are Microsoft, IBM, and Google our main security companies?
    It is sad but but I am afraid that in system security academia thinks companies don’t do research, and companies think academic papers are only theoretical exercises and professors are completely unaware of what happens in the real world and about what has been previously done outside the close world of the scientific conferences. I hope I am wrong.

Even though there are almost 700 unique affiliations in the dataset, the cumulative distribution function show that few institutions are basically running the show. In fact, the top 10% of the institutions account for almost 80% of the papers (just in case you were looking for another example of pareto distribution).

./institutions_cdf.png

Groups and Collaborations


Another fun thing we can do with this data is build a giant graph with all the authors as nodes and with edges showing if two researchers had joint publications. Unfortunately, the graph contains over 4.4K nodes and almost 15K edges, so it is not very easy to plot.

What is interesting is the fact that almost 85% of the authors are part of a giant connected cluster! In other words, you can move from co-author to co-author and reach almost everyone in the field. There are only five people who published more than three papers and are not connected to this giant component (last year there were nine).

The largest clique (i.e., the largest subgraph in which everyone is directly connected with everyone else) corresponds to the 18 authors of the "Manufacturing Compromise: The Emergence of Exploit-as-a-Service" paper. Interesting that there is nothing larger than that.

If we want to get something more compact, we can reduce the graph by keeping only the authors who published a minimum number of papers. With a threshold of nine papers, we obtain a completely connected graph with 166 nodes, 640 edges, and a diameter of 7. Still too crowded for a nice picture.

If we want the main groups to emerge from the graph, we need to follow a different approach to remove the noise. For instance, we can emphasize the strong collaborations by limiting the analysis to authors with at least five papers in common (and a total of 10 each). This results in 91 nodes, where you can start seeing some interesting clusters: (big picture here)

./main_groups.png

Still too big for your taste?
Ok, this is what you get when you plot only people with at least ten papers in common. The graph show the most productive collaborations between professors in our field.

./connected_graph_10.png

FAQ


  • It would be great to integrate some Google Scholar information in this page
    Indeed. Getting the number of citations for each paper would open the door to plot plenty of new interesting stats. If you know someone in Google that can let me do the query somehow.. please let me know.

  • I noticed a mistake in the authors or affiliations list
    Great, just send me an email and I’ll fix it and re-run the stats.

  • Wait a second. Why didn’t you use DBLP ? TopResearcher uses DBLP to extract the list of all the papers in several top conferences, including the Top4. However, over the same number of years, DBLP reports fewer papers for many authors. The difference is probably due to wrong spelling, but to be safe I preferred to re-parse the names myself.
    Moreover, I also wanted all the affiliations for each author, and that information was not available in DBLP.

  • It would be nice to have dynamically-generated pages and interactive plots
    Yes, I agree. What would be even nicer is to use a library to nicely display large customizable graphs. Something like yFiles would be fantastic for this purpose. But it is not free even for academia. So, unless you want to give me a license, this is not going to happen :(

  • Can you share your data/code?
    Yes, my plan is to find some time to polish everything up and then put the code and data on github. However, keep in mind that I do this in my spare time, that my spare time is very limited, and that I prefer to use my spare time to add new data or new analysis more than to polish ugly code. So.. don’t keep your breath yet.

  • When will you update this page again?
    Next year after CCS publishes the accepted papers.

  • You know what it would be also nice to check…
    Please, tell me. I love plotting graphs :D