Posted on June 18, 2015
This is a guest blog post originally written for Fluxicon and reprinted here. The original post can be seen on the fluxicon blog.
Timestamps are core to any process mining effort. However, complex real-world datasets frequently present a range of challenges in analyzing and interpreting timestamp data. Sloppy system implementations often create a real mess for a data scientist looking to analyze timestamps within event logs. Fortunately, a few simple techniques can tackle most of the common challenges one will face when handling such datasets.
In this post I’ll discuss a few key points relating to timestamps and process mining datasets, including:
As a data type, timestamps present two distinct challenges:
To a computer time is a continuous series. Subdivisions of time like hours, weeks, months and years are formatted representations of time displayed for human users. Many computers base their understanding of time on so called Unix time, which is simply the number of seconds elapsed since the 1st of January 1970. To a computer using Unix time, the timestamp of 10:34:35pm UTC April 7, 2015 is 1428446075. While you will occasionally see timestamps recorded in Unix time, it’s more common for a more human-readable format to be used.
Converting from this human readable format back into something that computers understand is occasionally tricky. Applications like Disco are often quite good at identifying common timestamp formats and accurately ingesting the data. However, if you work with event logs you will soon come across a situation where you’ll need to ingest and/or combine timestamps containing unusual formats. Such situations may include:
The following scenario is typical of what a data scientist might find when attempting to complete process mining on a complex dataset. In this example we are assembling a process log by combining logs from multiple systems. One system resides in New York City and the other in Phoenix, Arizona. Both systems record event logs in the local time. Two sample timestamps appear as follows:
System in New York City: 10APR2015 23.12.17:54
System in Phoenix Arizona: 10APR2015 20.12.18:72
Such a situation presents a few headaches for a data scientist looking to use such timestamps. Particular issues of concern are:
You can see how this can all get quite complicated very quickly. In this example we may want to write a script that ingests both sets of logs and produces a combined event log for analysis (e.g., for import into Disco). Our primary challenge is to handle these timestamp entries.
Ideally all system admins would be good electronic citizens and run all their systems logging functions in UTC. Unfortunately, experience suggests that this is wishful thinking. However, with a bit of code it’s easy to quickly standardize this mess onto UTC and then move forward with any datetime analytics from a common and consistent reference point.
First we need to get the timestamps into a form recognized by our programming language. Most languages have some form of a ‘string to datetime’ function. Using such a function you provide a datetime string and format information to parse this string into its relevant datetime parts. In Python, one such function is strptime.
We start by using strptime to ingest these timestamp strings into a Python datetime format:
# WE IMPORT REQUIRED PYTHON MODULES (you may need to install these first) import pytz import datetime # WE IMPUT THE RAW TEXT FROM EACH TIMESTAMP ny_date_text="10APR2015 23.12.17:54" az_date_text="10APR2015 20.12.26:72" # WE CONVERT THE RAW TEXT INTO A NATIVE DATETIME # e.g., %d = day number and %S = seconds ny_date = datetime.datetime.strptime(ny_date_text, "%d%b%Y %H.%M.%S:%f") az_date = datetime.datetime.strptime(az_date_text, "%d%b%Y %H.%M.%S:%f") # WE CHECK THE OUTPUT, NOTE THAT FOR A NATIVE DATETIME NO TIMEZONE IS SPECIFIED print(ny_date) >>> 2015-04-10 23:12:17.540000
At this point we have the timestamp stored as a datetime value in Python; however, we still need to address the time zone issue. Currently our timestamps are stored as ‘native’ time, meaning that there is no time zone information stored. Next we will define a timezone for each timestamp and then convert them both to UTC:
# WE DEFINE THE TWO TIMZEONES FOR OUR DATATYPES # NOTE: ‘ARIZONA’ TIMEZONE IS ESSENTIALLY MOUNTAIN TIME WITHOUT DAYLIGHT SAVINGS TIME tz_eastern = pytz.timezone('US/Eastern') tz_mountain = pytz.timezone('US/Arizona') # WE CONVERT THE LOCAL TIMESTAMPS TO UTC ny_date_utc = tz_eastern.localize(ny_date, is_dst=True).astimezone(pytz.utc) az_date_utc = tz_mountain.localize(az_date, is_dst=False).astimezone(pytz.utc) # WE PRINT CHECK THE OUTPUT, NOTE THAT THE TIMEZONE OF +0 IS ALSO NOW RECORDED print(ny_date_utc) >>> 2015-04-11 03:12:17.540000+00:00 print(az_date_utc) >>> 2015-04-11 03:12:26.720000+00:00
Now we have both timestamps recorded in UTC. In this sample code we manually inputted the timestamps as text strings and then simply printed the results to a terminal screen. An example of a real-world application would be to leverage the functions above to read in raw data from a database for both logs, process the timestamps into UTC and then write the corrected log entries into a new table containing a combined event log. This combined log could then be subjected to further analytics.
With timestamps successfully imported, there are several useful time functions that can be used to further analyze the data. Among the most useful are time arithmetic functions that can be used to measure the difference between two timestamps or add/subtract a defined period of time to a timestamp.
As an example, let’s find the time difference between the two timestamps imported above:
# WE COMPARE THE DIFFERENCE IN TIME BETWEEN THE TWO TIMESTAMPS timeDiff = (az_date_utc - ny_date_utc) print(timeDiff) >>> 0:00:09.180000 The raw output here reads a time difference of 9 seconds and 18 milliseconds. Python can also represent this in rounded integer form for a specified time measurement. For example: # WE OUTPUT THE ABOVE AS AN INTEGER IN SECONDS print(timeDiff.seconds) >>> 9
This shows us that the time difference between the two timestamps is 9 seconds. Such functions can be useful for quickly calculating the duration of events in an event log. For example, the total duration of a process could be quickly calculated by comparing the difference between the earliest and latest timestamp for a case within a dataset.
These date arithmetic functions can also be used to add or subtract defined periods of time to a timestamp. Such functions can be useful when manually adding events to an event log. For example, the event log may record the start time of an automated process, but not the end time. We may know that the step in question takes 147 seconds to complete (or this length may be recorded in a separate log). We can generate a timestamp for the end of the step by adding 147 seconds to the timestamp for the start of the step:
# WE ADD 147 SECONDS TO OUR TIMESTAMP AND THEN OUTPUT THE NEW RESULT az_date_utc_end = az_date_utc + datetime.timedelta(seconds=147) print(az_date_utc_end) >>> 2015-04-11 03:14:53.720000+00:00
Having the data cleaned up and ready for analysis is clearly important, but equally important is understanding what data you have and what it means. Particularly for data sets that have a global geographic scope, it is crucial to first determine how timestamps have been represented in the data. Relative to timestamps in your event logs some key questions you should be asking are:
While this piece was hardly an exhaustive look at programmatically handling timestamps, hopefully you’ve been able to see how some simple code is able to deal with the more common challenges faced by a data scientist working with timestamp data. By combining the concepts described above with a database it is possible to write an automated script to quickly ingest a range of complex event logs from different systems and output one standardized log in UTC. From there, the process mining opportunities are endless.