Sunday, August 16, 2015

Digital Discussions

Every so often I write about software ideas that I would love to see developed, but I'm pretty sure that I won't get the chance to work on them myself. This is another one of those posts, but this time I thought I would mix it up with some of the underlying analysis, since that is a rather poorly understood part of software development.

As the first stage in development, analysis is often where projects go horribly wrong. If you don't know what you are building or it's just a disorganized collection of "stuff" then the design and development stages are unlikely to get it back on the rails again. It's a trainwreck. All of the best code in the world is useless if it is the wrong code for the solution.

The problem I would like to explore is online discussions. Obviously we have a long history of discussing, arguing and debating our different views in the non-digital realm. Once software developers set upon moving that to computers they seemed to focus only on the contributors. That is, they made is easy to join an online discussion and take part, but not so easy to read it back later. We do have blogs with comments, email groups and some dedicated discussion sites, but each one of these is optimized for the writer.

Trying to follow an interesting discussion online is difficult because of the loose structuring and constant noise. There are some quiet, low traffic sites that I quite like, but generally once a discussion becomes popular, the ability to read it and learn from it gets drastically reduced. That seems backwards.

So, we want to keep the site relatively easy for the writer, but also to capture enough structure of the discussion that the readers can filter out the content to whatever depth they find useful. The key word here is 'structure'. We want to collect more meta-information around the discussion that would help people to navigate it afterwards.

The earliest online discussions were essentially just lists of text. This only works well for a small amount of text. Many sites moved to simple trees that tied the children to a parent so the reader could get some sense about how the conversation fragmented. Later people began adding ratings, so that group-designated noise could be ignored. These changes helped somewhat but did not go far enough.

Looking at how things evolved we see that there are at least two different aspects of the conversational structure. One is trying to connect responses back to their origin, and the other is trying to filter out poor quality content like trolls. Let's tackle that second problem first.

A good discussion generally involves people with expertise on a topic. There are always people with more relevant contributions, and for most readers that forms the core of the discussion. The problem is that to distinguish a good comment from a bad one requires intelligence and knowledge of the topic. Right now, that is something that only a human can do, not a computer. Moderators are sometimes used, but an active discussion can produce such a huge amount of effort to fully moderate that it becomes impossible. What we would like to do is to tap into intelligence, but distribute it so that it does not become a burden.

When I am in a discussion with people, I usually a have a definitive opinion about who is currently forming the core. I'll accept that view as being quite common, so that gets us to the point where we might expect that many of the participants have this same opinion. If we mix in a bit of selectivity, then we can leverage this by having each discussion start with a very small number of contributors. One of the originators decides to invite several others to talk about a topic. As the discussion continues, any of that original group can add in new members. To balance out the potential for mistakes, several combined members can kick anyone out of the discussion.

That's fine and all, but it still allows for a specific subgroup to hijack the conversation. Rather than prevent this, we just channel it into something more positive, so what we also need is a way for parallel discussions to exist. At any point, anyone can start a parallel discussion with a completely different group of people that intertwines with the original discussion. That hopefully reduces any incentives to derail the original conversation.

For example, a group of three well-known physicists starts talking about the viability of quantum computers. A parallel discussion gets formed by some PhD students and another one by just techies with an opinion. If the reader wants to see the collective discussion it can be presented as essentially one big tree or even a list. However if they just are curious about the first two groups, they can filter out the other stuff. In a sense they can traverse as deeply into the parallel groups as they want if they're interested in how the topic is going to be perceived by the masses, or they could just focus on the meatier parts.

Obviously parallel groups need to attach to the contributions in their patent discussions, but there doesn't need to be a cap on depth. Groups could form off the original discussion, and other groups could form off of those, and so on. The people taking part in the discussion have formed themselves into a tree of sets of invited contributors, structured by how the contributors feel about each other, and implicitly about how they feel about the quality of their contributions.

It becomes easy to see that most people wouldn't accept invitations from trolls and that at the margins of the discussion irrelevant material can easily be ignored. But at the same time, nobody is explicitly precluded from joining in, they might just not get significant readership because of what they say.

This extra structure is a bit more work for the contributors, but it naturally fits into with how people are drawn in and out of conversations. It also controls the noise.

The other aspect is also about structure. The simple tree structures that currently exist are a crude means to connect the sub-threads of a discussion, but once again we are not collecting enough structural information to be really useful.

Most conversations start by either making some explicit points about a topic or asking a series of questions. The question/answer structure is fairly powerful, which is why we see it used so often for interviews. It binds together loose threads, without having to fill the space with a lot of transitions.

Given both the point/counter-point and question/answer relationships what we would like to do is make them explicit. To do this we replace the unformatted blocks of text with these response types. That is, any entry in the discussion is either a point or a question and the writer has to indicate which.

Still, since the underly type is text, people wouldn't be particularly disciplined in their contributions. We have to expect that. We have to build on that.

Thus a discussion would start with one person choosing their 'point' or 'question' and typing in their text. From there, any responder would highlight some 'subportion' of that text, choose point or question again and then type in their response.

Thus, much like Quora, the discussion could start with a question. But there might be multiple questions, and implicit sub-points implied in the text. Each respondent could highlight a word, sentence, paragraph or whatever and then attach their points and questions to just that part. The discussion could also start with an essay. People could question the assertions or make a point about terminology usage. They could address three different issues independently, or just keep them all together in the same point. The writers would all find their own ways of utilizing this new structure.

Often in discussions people include both positive and negative references. So besides points and questions there would also be links. To further help, each of these three basic response types can be flagged as positive or negative. If someone wrote a couple of paragraphs that you mostly believe, but there was a flaw with some terminology, you could add in a negative point about that specific issue, while also adding an overall point about how you agree, and adding a link that substantiates a couple of their points.

It's not that much more work then hitting reply and writing one big, hard to read, textual entry. Just highlight a portion of the original text, fill out two dropdowns and then go to town.

Being able to strip out all of the questions really sets the breadth of the conversation. Being able to see how the topic breaks into threads and knowing who is on which side sets the context. You might only want to see the positive points, or the negative ones. It offers a lot of depth for the reader to filter down quickly to get knowledge.

Getting back to the quantum computer example if you are curious as to why some experts believe it to not be possible, you can extract these points, even if there are whole sub-topics that you don't understand. You might zero in on a sub-discussion between students on the current definition for a part of the theory that is confusing. All of the extra structure helps the reader navigate what has been written.

The value of software comes mostly from the underlying data that is collected. It's faster to just collect the central independent points and dump them into a simple list, but it is far more useful to collect higher level structural relationships as well. There is however a danger with this approach. It is easy to just create new arbitrary response types, for example. But if these are fuzzy or overloaded, they might actually distort the structure and cause a new set of problems. Because of that, structural modelling needs to justify any sort of categorization or complex relationship. The right ones provide value, the wrong ones add noise.

From a data-structure perspective the above is most likely a graph with polymorphic nodes, although it might also be acyclic. What that means in practice is that care must be taken to insure that the performance remains reasonable as the discussions grow. Obviously implementing an indexed list is both easier and has no such concerns. Too often programmers choose that easy route at the expense of their users. In the earlier generations of software this was not surprising, but given that we've already written this stuff multiple times, the advancement of our industry requires that we start tackling the more complex problems. Sophisticated user functionality is difficult to implement well, but it is also a far more appropriate solution.

Putting it all together, this design looks very much like a mix between Quora, discussion sites, blogs with comments and if you add privacy, even email. It's a little more effort for the writers, but it now captures enough structural data that it is a huge improvement for the readers. Unfiltered, it is the usual chaos, but with minimal work you can utilize this new data to focus in on extracting significant knowledge.

The general context is that in software we tend to focus too heavily on the independent data points, easily missing the more valuable structural relationships connecting these points. Collecting data used to be expensive, which no doubt resulted in us cheating. Now that data costs have come way down, we need to revisit the ways we analyse problems and then model the data. It's not enough to just zero in on the trivially obvious aspects anymore. The next generation of systems should be ones where we pursue complex data models to achieve sophisticated functionality that go deeper into solving the user's problems. Code is nice, data is useful, but really deep knowledge comes from the sophistication of the data model.