Ok, when did you first become aware of data management? No, it wasn’t in your database class in college. It wasn’t when you learned SQL or data modeling. It was back when you were a kid,
collecting baseball cards. You became fascinated by all the statistics. You had to figure out how to sort the cards. By year? By player? By manufacturer?
I found my son’s cards made excellent examples for normalization exercises when I was teaching data modeling some years ago-in spite of the fact that, well, ok, I don’t really know as much about
baseball as I am supposed to as a loyal American.
Recently, as I was trying to encourage my now grown son to work with me in the data modeling field, I tried to model the game itself. To his credit, my son was very good at pointing out where the
model was flawed and showed me how to fix it. What I found interesting was that the kinds of errors I made initially and the kinds of fixes he proposed were very representative of the process we
all go through in creating our first data model. Since the example isn’t anything like the usual commercial example, it seemed worthwhile to present the process as a kind of case study about the
modeling process itself.
The Baseball Card
We begin with baseball cards, as shown in Figure 1. Here you see that there are two kinds of cards: a batter card, here represented by the card for Boston Red Sox player Dwight Evans, and a pitcher
card, here represented by the card for New York Yankee Lee Guetterman. (Yes, these cards are from my son’s youth, back in 1989. And no, we haven’t kept up…)
Figure 2 shows the back of two different cards for the same players. The fronts, above, are from Topps, while the backs are from Donruss. The statistics for the two kinds of cards are different,
since one is evaluating the performance of a pitcher in getting batters out, and the other is evaluating the performance of a batter who is trying to hit the balls thrown by the pitcher.
For the batter, then, you have measurements of his performance at the plate for each year. For example, from the Donruss card:
Note that the definitions of these statistics are not always as simple as the above definitions would suggest. For example, a hit is only a “Hit” if there were no errors, and if the time at bat
was not a “fielder’s choice”.*
Note also that these are only the statistics captured by the Dunruss brand card. The Topps card also has ‘Slugging percentage” and “games started”, but not “steal”. Others can be calculated
from these. For example, the number of singles = hits – doubles – triples – home runs.
So, lets build a data model. By data model, we are talking about a conceptual data model, or entity/relationship model. It is a description of
the game of baseball. It is not a database design. That is, it is about organizing the data, not about creating something that will be easy to report on. Indeed, as we
will see, some of these statistics are really challenging to describe in data structure terms.
The Model – Version 1
A data model describes the things of significance to the organization, about which we wish to hold information. These are called “Entity Classes”. In the baseball card example, the first thing of
significance is, of course, a player, as is shown in Figure 3. Other entity classes of interest include the team that he is on and the position that he is the holder of.
That is, each player must be on one and only one team and may be holder of one and only one position.
Note that the above sentence was taken directly from the relationship names on the diagram. Other sentences that can be derived from the diagram are: “Each team may be composed of one or
more players”, and “Each position may be held by one or more players”. Naming relationships is according to the rules shown in Figure 4. Relationship names are properly prepositions,
not verbs. It is the preposition that is the part of speech that describes relationships. Verbs (commonly used) describe activities, which are more properly represented on a different kind of
Attributes of a player (the information to be held about him) of course include “Player number”, “First name”, and “Last name”. In addition, the attribute “Throws handed indicator” can be
“Left handed” or “Right handed”. The attribute “Bats handed indicator” can be “Left handed”, “Right handed”, or “Switch”.
In an early draft of this model, the line next to player was solid, asserting that each player must beholder of exactly one position. This turns out not to be true. In some cases, a
player may not have a position permanently assigned. Hence, the model had to be corrected. The whole point of producing a data model, after all, is to create a set of assertions that can be
validated by subject matter experts. Our job is to be wrong, so we can learn what is right.
In this case, there is actually more work to do. Take a look at the model in Figure 3 again. Is it really true that a player can play on only one team? Or that he can play on only one
position? Especially over time? But if we allow more than one in each case, these will become “many-to-many” relationships that will not go well with our system designers. In addition,
it is useful to understand exactly what is going on each time a player plays a position for a team. That, in fact, is where statistics will be collected. Figure 5 shows player membership, the fact
that a player was on a team, and potentially holding one position, at a particular time. That is, each occurrence of a player membership must be held by one player, with a team,
and as a player of a position.
Now, let’s add the baseball card, as shown in Figure 6. At first glance, it appears that each baseball card must be a report on one and only one player, and must be made by one
and only one card manufacturer. (The front sides of the cards shown above are by Topps, while the back sides are by Donruss.) Specifically, each baseball card must be to describe on player
The sub-type structure shows that each baseball card must be either a pitcher card or a batter card. That is, an occurrence of a batter card, by definition, is also an occurrence of a baseball card.
All attributes (“Card number”, “Year”) of baseball card are also attributes of pitcher card and of batter card. Similarly, all relationships with baseball card (“To describe player
membership” and “Made by” card manufacturer”) also are relationships to pitcher card and batter card. The attributes specific to pitcher card and batter card, however, are only
attributes of those entity classes and none other.
This still isn’t right. If you look at the Lee Guetterman card in Figure 2, above, you’ll see that in fact he played on two different teams over the years, so the card cannot simply show one
player membership. It’s time for another entity class, this time called baseball card line, as shown in Figure 7. Note that each occurrence of baseball card line is identified both by the
“Playing year” (represented by the octothorpe (#) next to “Playing year”) and the baseball card it is part of (represented by the small line across the part of relationship
role). Thus, the 1988 statistics will appear as separate baseball card lines on both the 1988 and 1989 Topps cards and on the 1988 and 1989 Donrus cards.
You can see the statistics we described previously as attributes of each of the sub-types. Actually, that is not quite true: the attributes above are from the Topps card, not the Donruss card as
was previously described. As we can see the two different manufacturers don’t capture quite the same statistics. In addition, on the Major League Baseball web site are listed
37 batting statistics, and 38 pitching statistics. It may true that most 11-year-olds are not interested in number of times a batter is hit by a
pitch or his “on-base percentage”, but then, who knows? Having the statistics “hard-coded” as attributes of baseball card line is simply not practical.
This leads us to the version shown in Figure 8. In this, a statistic is something to be measured, like “Earned run average”, or “Number of games played”. The fact that a statistic is captured
on a particular baseball card line (either a pitcher card line or a batter card line) is a line statistic. A statistic, then, can appear on multiple baseball card lines (from multiple
manufacturers), or, it doesn’t have to appear on any of them. Note also that an actual player statistic value is for a single player membership, regardless of the number of baseball cards it may
Model – Version 2
The problem with the above model is that it doesn’t tell us anything about the nature of the statistics. Some of them listed above were shown as “complex”, which means that they are not derived
from a simple manipulation of other statistics. An “Earned run”, for example, is a run scored by a player who got on base without benefit of an error. This means not only does it not count if an
error is made in fielding a ball that he hit, but if an error prevents an out that would have retired the side, no hits after that point are “earned”. This definition is not reflected in Figure
What is required is for us to model the game itself. This begins with Figure 9. In this, our player membership is on a team that is now shown to be located in a stadium. In this case stadium refers
to a home stadium. The Houston Astros play in Houston, Texas at Minute Maid Park, while the texas Rangers play in Arlington, Texas at Ameriquest Field.
Normally, a game is played at one of these stadiums between exactly two teams, with the team located in that stadium by definition being the home team for the game, while the opposing team is the
away team for that game. That is, there is a business rule that if a team is the home team for a particular game, it must be located in the stadium that is the site of the
There are two kinds of exceptions to this: first of all, the All Star Game, is played once a year between teams representing all teams in each league. In this case, neither team is located in one
stadium. Hence the dotted line on that relationship, although a business rule states that teams that are members of the leagues must be located at one stadium. As it
happens, one team’s stadium is chosen for the location of the game (a game is played at one stadium), and the team of the league that the team belongs to is designated the home team for that game.
The relationship that each game may be played at one stadium is for circumstances like this where you cannot assume that the game is played at the stadium that is the location of the team
that is the home team in the game.
Also, sometimes games are played outside the United States, in which case a decision is made as to which team is home team for the purposes of that game. Again, you cannot assert that the stadium
of that team is where the game is being played. Again, it is useful to be able to independently assert that the game is played at a (for example Tokyo) stadium.
Each game is composed of one or more half innings, where each half inning is the fact that a particular team is at bat. If the “Top/bottom indicator” is “Top” then the team which is
the away teamfor the game is at bat. If it is “Bottom”, then the home team is at bat.
The game is played when a succession of players from the team that is at bat step up to “home plate” to attempt to hit a ball thrown by the pitcher. That is, one player is the batter for
one or more plate appearances, and each of these plate appearances is the occasion for one or more pitches by the player who is the pitcher for that pitch. Played position records
the fact that various of the players (typically those in the field) are assigned to play a particular position during that pitch. That is, each played position during a pitch must
be played by a player and must be the playing of a position. In addition, played position can record who the opposing pitcher is during that pitch. Note that the position
embodied in a particular played position during one pitch may not be the normal position played by a player as recorded in his player membership.
Now, what can happen as a result of each pitch?
- Swinging strike
- Looking strike
- Foul strike
- In Play
The plate appearance can have the following outcomes:
- Foul out
- Caught fair ball
- Ground out
- Home run
- Fielder’s choice
- Sacrifice fly
- Sacrifice bunt
- Hit by pitch
Figure 11 shows how these outcomes are represented in the model. A pitch outcome is from a single pitch-Strike, Ball, etc. A plate appearance outcome is the overall result (Hit, Walk, Out,
etc.) of the player’s being up to bat. It is the result of a plate appearance. Each plate appearance results in a single plate appearance outcome.
In addition to the actual outcome of the pitch, other things will be going on around the field. Specifically, either someone may be put out, or a player may advance by one or more bases. Each field
event (a field out, a player advance, or an error) is one of these other happenings. The field event must be during a pitch (or immediately following one), and it must be by a
player, It also must be an example of one field event type, which redundantly expresses the sub-type structure of field event. (The first three field event types must be “Field out”,
“Player advance”, or “Error”.)
Specifically, “Player advance” is the super-type of the following field event types:
- Run scored
- Home stolen
- Advance to home on a hit
- Base stolen
- Movement to another base on a hit
The field event type “Field out” may be the super-type of the following field event types:
- Caught ball
- Force out of runner
- Caught stealing
- Tag out
A field event type “Error” has no sub-types, but an error may be the cause of a player advance.
a field event type “Multiple Play” is the super-type of the following field event types:
- Double play
- Triple play
In addition, a multiple play must be composed of one or more field outs.
A field event type “Other” may be the super-type of the following field event types:
- Wild pitch
- Passed ball
Each field event must be by one player, either the player advancing or the player responsible for the field out, error, etc. In addition, other players may assist in the play. That is,
each field event may be helped by one or more assist roles, each of which is played by a player.
In recording the history of a game, it is also important to know the sequence of events, specifically in terms of the path of the ball. If the ball was hit to center field, then thrown to second
base, and then thrown to first base, these are three instances of ball travel which are part of a particular field event. In the case of a multiple play, the ball travel is recorded for
each component field out.
For example, if the ball is hit to the shortstop and thrown to first base for an out, the field event is a field out by the player who is playing first base, with the “Base” of the field out
being “First”. One instance of ball travel records that the ball went from “Shortstop” to “First Base”, and the player who is playing shortstop is the player in an assist role in that field
Now let’s return to that definition of “Earned run” discussed earlier. An earned run is a run for which the pitcher is held accountable, and shall be charged every time a runner scores after he
originally got on base without help from an error. That is, he got on base through a hit, a walk, he was hit by a pitch, or a fielder chose to tag or force out someone else out on a different base.
4th game of the 2004 American League Playoffs: New York Yankees vs. Boston Red Sox
It was the bottom of the 14th inning in the fourth game of the playoffs. The Yankees had won three games already and were expecting to wrap it up tonight. Here’s what happened:
Esteban Loaiza pitches to Mark Bellhorn
- Pitch 1: ball 1
- Pitch 2: strike 1 (looking)
- Pitch 3: strike 2 (foul)
- Pitch 4: ball 2
- Pitch 5: strike 3 (swinging)
Esteban Loaiza pitches to Johnny Damon
- Pitch 1: ball 1
- Pitch 2: ball 2
- Pitch 3: strike 1 (looking)
- Pitch 4: ball 3
- Pitch 5: ball 4
Esteban Loaiza pitches to Orlando Cabrera
- Pitch 1: strike 1 (swinging)
- Pitch 2: strike 2 (foul)
- Pitch 3: ball 1
- Pitch 4: strike 3 (swinging)
Esteban Loaiza pitches to Manny Ramirez
- Pitch 1: ball 1
- Pitch 2: strike 1 (foul)
- Pitch 3: ball 2
- Pitch 4: strike 2 (looking)
- Pitch 5: foul
- Pitch 6: ball 3
- Pitch 7: ball 4
- Pitch 1: strike 1 (swinging)
- Pitch 2: ball 1
- Pitch 3: strike 2 (foul)
- Pitch 4: foul
- Pitch 5: foul
- Pitch 6: ball 2
- Pitch 7: foul
- Pitch 8: foul
- Pitch 9: foul
- Pitch 10: in play
Player Advance: Run Scored! by J Damon
NY Yankees 4, Boston 5
In another example, the Yankees were playing Minnesota. It was the top of the third inning and Derek Jeter is on first base:
Johan Santana pitches to Gary Scheffield
- Pitch 1: strike 1 (foul)
- Pitch 2: strike 2 (foul)
- Pitch 3: in play
Again With the Statistics
Now we can revisit the collection of statistics from the beginning of this paper. In Figure 13 we again see player statistic value of a statistic and for a player membership. Typically, each player
statistic value is an aggregate for a year, but the attributes “Begin date” and “End date” allow specification of any time period.
What we don’t see in the model is the navigation that is required to capture each statistic, although we do at least have a place to do the navigation:
- Games played – For a player, count the number of games (determined from plate appearance / half inning s) played during the year.
Batting average – For a player, count the number of plate appearance outcomes whose outcome is a hit, and divided by the number of plate appearances whose outcome is an
Complete games – For a player who is player of the position “Pitcher”, count the number of games (through pitch / plate appearance / half inning /) in which the player
was the pitcher for all pitches.
Many of the other statistics are more complicated and are left to the reader as a homework assignment. All information necessary is contained in the model.
Figure 13 also shows team statistic value. This is the value of a statistic for the whole team in one game.
A Final Thought
Developing this model has been a true exercise in the difficulty of extracting information from subject-matter experts. Where my experience with patterns has brought me to the point where doing a
model in a commercial environment is relatively easy, I was in unfamiliar territory here. As I stated above, my particular upbringing has left me (let’s see, what is the current PC term?)
disadvantaged, when it came to my understanding of the game. I knew the basics, of course, and I’ve been to several games around the country, but I never really understood what went into compiling
these statistics. The rules of the game are far more complex and subtle than I had ever imagined. It has taken several drafts for my son to clarify my thinking and my understanding to the degree
represented by this paper.
But the exercise has been exactly what is required to produce any data model in any field of endeavor.
Discussing baseball is much like working with computers. If you don’t know something, ask your kids.
*Fielder’s choice – A play made on a ground ball in which the fielder chooses to put out an advancing base runner, thus allowing
the batter to reach first base safely.