Scholarly Paper Application of Natural Language Processing and Text Mining to Identify Patterns in Construction-Defect Litigation Cases Downloaded from ascelibrary.org by University Of Auckland on 08/25/19. Copyright ASCE. For personal use only; all rights reserved. Yashovardhan Jallan, S.M.ASCE 1; Elizabeth Brogan, P.E., A.M.ASCE 2; Baabak Ashuri, Ph.D., M.ASCE 3; and Caroline M. Clevenger, Ph.D., P.E., M.ASCE 4 Abstract: Recently, construction-defect litigation has upsurged across the United States. Disputes arise due to a variety of reasons, and result in a range of negative impacts on construction projects, such as increased cost, delay, profit loss, and inconvenience. Although the majority of these disputes settle out of court, a public trail of legal records exists. Previous research has generally been limited to exploring a small subset of such cases based on restricted access to records and data. This ongoing research automates systematic exploration of construction-defect lawsuits in the public domain by using modern computational capabilities of natural language processing and text mining to conduct a comprehensive survey of legal cases over the last 10 years. The approach of this research is to use coded text mining to automatically identify and analyze thousands of publicly available construction-defect cases. To perform such research, the authors developed a program that trolls the national legal database, LexisNexis. Key contributions include the development of a model that can find the frequencies of keywords in the cases and apply a statistical algorithm called Latent Dirichlet Allocation (LDA) to identify important topics and themes in order to classify the case data. The research demonstrates new methods for exploring publicly available construction-defect cases. Major challenges are identified and discussed. As exploratory research, the findings are intended to inform and motivate future study, which may lead to identification of broad-based trends in construction-defect litigation. DOI: 10.1061/(ASCE)LA.1943-4170.0000308. © 2019 American Society of Civil Engineers. Introduction The construction industry has witnessed an increase in the number of litigation cases arising from construction defects over the last few years (Noble-Allgire 2009; Grosskopf et al. 2008; Aalberts 2005). This increase has been caused by a surging number of construction projects and a rise in aggressive litigation. In some instances, the plaintiff is actively incentivized, either with a financial motive or as insurance against possible future claims, to search for any and all defects, irrespective of the seriousness of related damage. Increased construction-defect litigation, however, is generally expensive for all stakeholders and leads to higher insurance costs. The goal of this research is to build a text-mining tool to identify patterns within the increased construction-defect litigation by examining the language of public legal filings. This research builds on 1 Ph.D. Student, School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA 30332. ORCID: https://orcid .org/0000-0002-2076-0133. Email: [email protected] 2 Ph.D. Student, Construction Engineering and Management, Dept. of Civil Engineering, Univ. of Colorado Denver, 1200 Larimer, Denver, CO 80204. Email: [email protected] 3 Associate Professor, School of Building Construction and School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA 30332. Email: [email protected] 4 Associate Professor, Construction Engineering and Management, Dept. of Civil Engineering, Univ. of Colorado Denver, 1200 Larimer, Denver, CO 80204 (corresponding author). ORCID: https://orcid.org /0000-0003-2265-8447. Email: [email protected] Note. This manuscript was submitted on October 31, 2018; approved on January 14, 2019; published online on July 17, 2019. Discussion period open until December 17, 2019; separate discussions must be submitted for individual papers. This paper is part of the Journal of Legal Affairs and Dispute Resolution in Engineering and Construction, © ASCE, ISSN 1943-4162. © ASCE previous research documenting emerging patterns in constructiondefect ligation (Brogan et al. 2018). In the previous research, the 10 most commonly cited construction defects were determined based on manual content analysis of construction cases and review of specific reports that identified the defects. Large-scale manual text and content analysis, however, is time-consuming, error prone, and subject to data bias according to the available sample of cases. This research provides a novel approach to address these challenges in the form of a pilot implementation of natural language processing and text mining to identify commonly occurring patterns and themes surrounding construction-defect litigation. The authors identified the national legal database, LexisNexis, as a massive source of data consisting of summaries of thousands of prior construction-defect litigation cases. This research has developed a pilot tool to crawl several hundred recent construction litigation cases and generate keywords and topics to facilitate the contentanalysis procedure and perform a cursory exploration of the construction litigation landscape. The authors do not claim that the current implementation of artificial intelligence is sufficiently calibrated or nuanced to meaningfully analyze construction law. However, the goal of this research is to test if text mining and natural language processing may provide a powerful and scalable new method capable of identifying broad patterns and generating broad summaries of publically available construction litigation data. Literature Review and Research Objectives This research effort was borne out of earlier research. Brogan et al. (2018) conducted a detailed study of 41 cases representative of construction-defect litigation in Colorado and surrounding regions. These 41 cases occurred in the period of 2015–2017, and the claimants were seeking at least $1 million to address the alleged 04519024-1 J. Leg. Aff. Dispute Resolut. Eng. Constr., 2019, 11(4): 04519024 J. Leg. Aff. Dispute Resolut. Eng. Constr. Downloaded from ascelibrary.org by University Of Auckland on 08/25/19. Copyright ASCE. For personal use only; all rights reserved. construction-defect issues. Researchers with legal (Juris Doctor) training completed a thorough manual content analysis of plaintiffs’ expert reports, which contained detailed descriptions of individual construction defects and repair plans to remedy the issues. They identified 55 types of defects across 41 cases. Of these 55 defects, 10 defects were found to be the most common, occurring in at least 50% of the cases reviewed. Additionally, these 55 issues were broadly classified by the researchers into seven major groups of issues, namely (1) structural issues; (2) civil issues; (3) building envelope issues; (4) roof issues; (5) deck/balcony/porch issues; (6) fire protection issues; and (7) miscellaneous issues. The present research extends that previous work by incorporating modern computing capabilities and enhancements of largescale text analysis, data science, and natural language processing to create an automated content-analysis tool that can identify the incidence of similar construction defects across several hundred publically available court summaries. Text-mining and natural language processing techniques are gaining popularity to assist lawyers and legal firms in review of legal cases. Dragoni et al. (2016) developed an approach to extract syntax-based rules from legal text and a logic-based extraction of dependencies between chunks of text. Several technology companies are creating products developing capabilities like automated contract review, legal-data research, and systems to predict the outcome of a case, among others. There are dedicated websites which post news and developments in the field of artificial intelligence applied to law. Courses teaching quantitative methods with applications in law are also being offered, both on college campuses and over massive open online courses (MOOC). Dedicated open-source libraries like LexNLP in Python, developed by Bommarito et al. (2018), have been built to extract information for legal and regulatory use. These advances can be attributed to the ability of such methods to uncover information not possible through manual analysis. Researchers have also made strides implementing similar techniques within the construction industry. Some of the earliest work in this niche was done by Caldas and Soibelman (2003), who created an automated hierarchical document classification for managing construction documents. Tixier et al. (2015) developed a contentanalysis tool for implementation in construction safety to identify the causes and outcomes from injury reports. Zhang and Ashuri (2018) mined building information model (BIM) log files to measure design productivity. Mahfouz et al. (2018) conducted a study to identify the latent legal knowledge in differing site condition (DSC) litigation cases. These research efforts have found varying levels of success with the implementation of the modern computational tools within the construction industry. Although the premise of these implementations is similar, i.e., utilize large digital data sets and leverage natural language processing, data science, and text analytics to find actionable insights that can solve current and future problems, this research is a novel application of such techniques to identify patterns within construction-defect litigation. The objective of this research is to provide a pilot implementation that leverages natural language processing, data science, and text analytics on a large-scale digital database to identify patterns in construction-defect litigation and document related opportunities and challenges. Data Source The data source for this research was Nexis Uni, the student version of the LexisNexis legal database (hereafter called LexisNexis), which is a source of summaries for publicly available law cases. The limitations of the database are significant. Namely, LexisNexis © ASCE does not provide full documentation for cases settled out of court or decided through arbitration. Additionally, the cases in LexisNexis are limited to ones that have been appealed. Current search methods rely on keyword search. Lastly, the database includes case summaries only and lacks technical detail. The team used the keyword “construction defect” for the searches, which provided cases that were outside targeted constructiondefect litigation content. For example, the cases included a product liability suit for “defect in [manufacturing] construction” (Lyles, v. Medtronic Sofamor Danek). Another case involved a yacht insurance policy because the damage did not fall within the latent-defect exception (Ardente, v. Std. Fire Ins. Co.). Research Approach In an effort to have a broad data source using LexisNexis, both state and federal cases were included and refined with keyword “construction defect.” Construction-defect cases frequently occur at the state level. However, construction cases will be held in a federal court when there is party diversity (parties located in different states) or in cases involving national building codes such as the Americans with Disabilities Act. There were 1,498 cases meeting these parameters. Using these cases, authors implemented a code in open-source programming language Python to read digital documents. The methodology can be divided into two broad techniques: (1) a supervised approach of frequency analysis for the 47 terms identified as relevant by subject-matter experts, and (2) an unsupervised approach of implementing Latent Dirichlet allocation (LDA), a natural language processing algorithm used for topic modeling. Preprocessing and Data Cleaning Any implementation of natural language processing or text mining requires a series of preprocessing steps and text cleaning to provide text data that can be read by the programming tool consistently and correctly. The preprocessing steps were as follows: 1. Convert all the data into lowercase to ensure consistency. 2. Remove all punctuation, digits, and symbols like asterisks, pound signs, currency symbols, and percentage symbols, among others. 3. Remove all proper nouns. 4. Remove any words less than or equal to a length of two characters. 5. Remove a host of stop-words (and, or, not, although, and but, among others). The preprocessing steps are intended to retain meaningful data and eliminate a significant portion of confounding information. Frequency Analysis of Important Keywords The first analysis was a frequency analysis of predetermined keywords to help verify if the data set had similarities to the manual analysis. The intent of frequency analysis is to use keyword frequencies in the text as a proxy for issue relevance. The authors readily acknowledge that such an oversimplified technique of complex and nuanced legal cases and issues is problematic and can in no way replace expert technical analysis. Nevertheless, this research sought to test if patterns in keyword frequencies could be identified as consequential. Based on previous research, the authors identified a list of 41 words with increased frequency in the plaintiff expert reports (Brogan et al. 2018). These 41 words are asphalt, brick, clearance, 04519024-2 J. Leg. Aff. Dispute Resolut. Eng. Constr., 2019, 11(4): 04519024 J. Leg. Aff. Dispute Resolut. Eng. Constr. Table 1. Sample of frequency analysis for individual cases sorted by incidence of unique keywords in descending order Downloaded from ascelibrary.org by University Of Auckland on 08/25/19. Copyright ASCE. For personal use only; all rights reserved. Case number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Total words in analysis Total occurrence of keywords Unique keywords Grade Slope Foundation Flatwork Differential Concrete Asphalt Drainage Drain 5,305 8,415 3,931 7,969 6,053 2,621 3,841 4,917 2,096 1,963 3,234 2,134 6,423 2,592 4,949 3,635 2,276 3,871 7,457 3,535 165 365 50 177 180 84 81 65 36 68 123 71 86 72 81 133 71 62 151 139 25 23 22 20 16 16 16 15 15 15 15 15 14 14 14 14 14 13 13 13 1 1 4 0 1 0 0 0 6 2 1 0 1 0 1 10 0 0 0 1 3 1 2 0 0 0 0 0 0 0 4 0 5 0 5 9 2 2 2 0 2 7 4 1 0 2 8 6 2 3 0 1 7 7 6 44 7 18 0 22 4 11 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 13 50 2 2 7 1 1 1 0 0 0 1 0 1 0 4 1 3 0 12 1 23 1 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 1 0 4 5 3 0 1 19 7 0 2 0 0 0 18 4 16 11 2 10 1 13 1 1 5 1 0 1 2 4 1 0 1 0 1 0 1 0 3 0 0 31 Note: Bold indicates number of unique keywords. concrete, differential, downspout, drain, drainage, exterior insulation and finish system (EIFS), expansion, fastener, flashing, flatwork, foundation, galvanized, grade, grading, gutter, handrail, lapping, masonry, membrane, moisture, precast, roof, roofing, selfadhering membrane (SAM), screed, scupper, siding, sill, slope, structural, stucco, surface, walls, water, waterproofing, weep, window, and weather-resistant barrier (WRB). These words represent keywords identified as important by experts in previous analysis of case files. The authors wrote a program using these keywords (or abbreviations) to read through the 1,498 cases, apply the preprocessing and data-cleaning techniques described in the previous section and calculate the individual frequencies of the words across the cases. A sample (20 cases and 9 keywords) of the highest frequency results is presented in Table 1. Based on the results obtained, the authors identified over 366 cases containing five or more keywords. Inversely, 1,132 cases containing fewer than five keywords were identified, which may indicate that either the keywords were poorly chosen (too broad or too specific), or the data set does not contain the type of information required to complete meaningful analysis. The keywords concrete, window, water, roof, foundation, and structural were the top six most commonly occurring words in this frequency analysis. Each occurred more than 1,000 times in the data set, with water being the highest at 3,864 occurrences. The very specific words that would be utilized to describe specific construction defects on a project like screed, EFIS, WRB, walls, and galvanized were observed to have extremely low frequencies in the analysis. In general, results of the analysis revealed relatively low frequency of keywords. With such analysis, it is difficult to precisely determine the source and/or extent of potential shortcomings and limitations. However, potential limitations include the following: • Frequency is a poor proxy for measuring severity or significance of a construction defect or issue. In particular, the experts manually reviewing the cases highlighted that interpretation in addition to frequency is critical. For example, words such as settlement, joint, © ASCE and binding could appear throughout construction-defect legal cases with very different (technical or legal) meaning depending on the context and could render the frequency data irrelevant. • LexisNexis text files lack necessary technical descriptions of construction defects for issue identification. • Expert analysis is necessary for meaningful interpretation of technical details, which may be superficially presented in the case summaries. • Wording related to the technical details of construction defects vary to such an extent (i.e., grading versus slope or soil versus dirt) that individual keyword selection is problematic and/or misleading. Nevertheless, and despite such potentially significant limitations, initial pilot frequency analysis is compelling and suggests that further and more sophisticated exploration of such techniques is merited and may, in fact, lead to useful methods for identifying broad-based patterns related to construction litigation. Unsupervised Approach Implementing Latent Dirichlet Allocation Unsupervised learning on text data involves applying algorithms on a data set that has not been manually labeled, classified, or categorized by the user. Instead, trends and patterns in the input data are found based on the inherent properties and characteristics of the data, rather than potential biases introduced by the user. The second analysis of this study was the application of unsupervised approach on 59 cases, a subset of the 1,498 cases that originated from the State of Colorado in the period 2000–2017. The subset was again generated through keyword search of LexisNexis and was intended to contain cases similar to the type of cases previously analyzed manually (Brogan et al. 2018). However, the cases were not identical with those previously analyzed given that many of the cases previously analyzed were not publically available through LexisNexis because many had not gone to court or been appealed. For analysis, the authors implemented a statistical modeling technique called Latent Dirichlet Allocation, which is commonly 04519024-3 J. Leg. Aff. Dispute Resolut. Eng. Constr., 2019, 11(4): 04519024 J. Leg. Aff. Dispute Resolut. Eng. Constr. Downloaded from ascelibrary.org by University Of Auckland on 08/25/19. Copyright ASCE. For personal use only; all rights reserved. Fig. 1. Representative screenshot of topics identified by LDA on 59 construction defect cases in the State of Colorado. used in topic modeling for natural language processing applications. In LDA, each document may be viewed as a mixture of various topics where each document is considered to have a set of topics that are assigned to it via LDA. In this research, the goal was to develop a pilot implementation of LDA for a constructiondefect litigation data set and observe the topics/themes that were produced by the algorithm in an unsupervised manner. The actual implementation was done in Python using libraries such as Natural Language Toolkit (NLTK), Gensim for developing the model, and pyLDAvis for visualizing the results. Based on a coherence measure (Röder and Hinneburg 2015) that quantifies the strength of the topics obtained, a total of 14 topics were generated for the input data as shown in Fig. 1. Each topic produced by LDA is a distribution of words with different weights, and usually the top 5–10 words by weight can be used to describe the overarching theme captured by the topic. Table 2 presents the top five words obtained associated with each of these topics. Next, individual cases are attributed to the topics obtained, as shown in Fig. 2. These cases can be classified as one or more topics. For example, case number 1 was found to correspond to Topic 13. The interpretation here is that the main themes in the case can be captured by words in Topic 13, which are duty, subcontractor, home, loss, and negligence. This aforementioned procedure can essentially analyze a large set of cases into a number of topics, and then each case can be classified as one or more topics based on its content. One of the desired outcomes of the unsupervised section was to find a correlation or similarity between Brogan et al.’s (2018) © ASCE manually collected data and the unsupervised section data. However, there was not a strong similarity in the results. Furthermore, several of the topics (e.g., Topics 4, 5, 8, 9, and 12) appear to not be relevant to or representative of construction-defect litigation. However, analysis was limited. Therefore, it is difficult to determine all sources of potential shortcomings or limitations within the approach. In general, however, potential limitations include the following: • The data available on LexisNexis are limited compared with the entire job file. Specifically, the information on LexisNexis is a Table 2. Top five words for each topic obtained after running the LDA analysis Topic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Words Contractor, duty, owner, residential, completed Company, concrete, insured, cost, damage Damage, home, period, insured, occurrence Commissioner, authority, copy, regulation, response Repose, improvement, tolling, stat, rev Representation, preclusion, developer, related, substantially Homeowner, lot, home, condition, city Seller, duty, district, home, loss Deed, tax, commercial, security, operation Damage, city, cost, duty, negligence Association, unit, owner, power, condominium Payment, defense, period, indemnification, cost Duty, subcontractor, home, loss, negligence Damage, cost, district, home, final Note: rev = revised; and stat = statute. 04519024-4 J. Leg. Aff. Dispute Resolut. Eng. Constr., 2019, 11(4): 04519024 J. Leg. Aff. Dispute Resolut. Eng. Constr. Downloaded from ascelibrary.org by University Of Auckland on 08/25/19. Copyright ASCE. For personal use only; all rights reserved. Fig. 2. Representative sample results with the attribution of the topics for each analyzed case. summary of the actual engineering design and construction issues. Additional expert reports exist that are not available on LexisNexis, which clearly describe each construction defect. The expert reports were utilized in the manual collection completed by Brogan et al. (2018). Thus, the data source was different, and results may not be comparable. • Due to the limitations described in the data available on LexisNexis, the topics were too generic and did not have sufficient detail to see trends that might be expected between residential construction defects versus industrial improvements. • Expert analysis is necessary for meaningful interpretation of case summaries. • Wording related to the technical details of construction defects vary to such an extent that keyword topic selection is problematic or misleading. Despite such potentially significant limitations, initial pilot LDA analysis is compelling and suggests that further and more sophisticated exploration of such techniques is merited, particularly with increased involvement of experts. Finally, the visual graph may be valuable tools to illustrate and increase understanding of the trends in construction defect litigation. References List of Cases Ardente, v. Std. Fire Ins. Co., 744 F.3d 815, US District Court for the District of Rhode Island (2014). Lyles, v. Medtronic Sofamor Danek, USA, Inc., 871 F.3d 305, US Court of Appeals, Fifth Circuit (2017). Works Cited Conclusions This research developed and piloted implementation of construction-defect litigation analysis using natural language processing and text mining to generate and support an automated content-analysis and classification tool. A two-part methodology was adopted. The first step is to identify the frequencies of the keywords in the body of these cases. This step is essential to understanding whether the input data are appropriate to apply more sophisticated text-mining algorithms. The second step is to apply Latent Dirichlet Allocation, a statistical topic modeling algorithm, to produce the overarching topics/themes underlying the input data set. The LDA model produces topics that consist of keywords that are deemed able to capture the essence of that topic. Using these topics, cases can then be attributed to the topics based on how closely they align. Limitations Several, and potentially significant, challenges were identified with this research. Namely, in the absence of expert engineering reports, LexisNexis may not provide a sufficient data source for either frequency analysis or LDA. Typically, expert engineering reports are not publically available. However, expert engineering reports contain in-depth details about the defects occurring in the case © ASCE being litigated and provide insight into the engineering and construction aspect. Brogan et al. (2018) conducted manual content analysis using expert engineering reports. The LexisNexis database may not be sufficiently comprehensive to identify patterns that emerge in the technical details of construction defects. In summary, pilot implementation of artificial intelligence was not able to provide similar results to manual expert content analysis. Nevertheless, the main contribution of the research is to identify and highlight challenges and opportunities associated with applying automated techniques to publically available databases for construction defects. Aalberts, R. 2005. “To sue or not to sue: The past, present and future of construction defect litigation in Nevada.” Nevada Law J. 5 (3): 684–703. Bommarito, M. J., D. M. Katz, and E. Detterman. 2018. “LexNLP: Natural language processing and information extraction for legal and regulatory texts.” Accessed June 6, 2018. https://ssrn.com/abstract=3192101. Brogan, E., W. McConnell, and C. M. Clevenger. 2018. “Emerging patterns in construction defect litigation: Survey of construction cases.” J. Leg. Aff. Dispute Resolut. Eng. Constr. 10 (4): 03718003. https://doi.org/10 .1061/(ASCE)LA.1943-4170.0000277. Caldas, C. H., and L. Soibelman. 2003. “Automating hierarchical document classification for construction management information systems.” Autom. Constr. 12 (4): 395–406. https://doi.org/10.1016/S0926-5805 (03)00004-9. Dragoni, M., S. Villata, W. Rizzi, and G. Governatori. 2016. “Combining NLP approaches for rule extraction from legal documents.” In Proc., 1st Workshop on Mining and Reasoning with Legal texts. Sophia Antipolis, France: HAL. Grosskopf, K. R., P. Oppenheim, and T. Brennan. 2008. “Preventing defect claims in hot, humid climates.” ASHRAE J. 50 (7): 40–52. Mahfouz, T., A. Kandil, and S. Davlyatov. 2018. “Identification of latent legal knowledge in differing site condition (DSC) litigations.” J. Autom. Constr. 94 (Oct): 104–111. https://doi.org/10.1016/j.autcon.2018.06.011. Noble-Allgire, A. 2009. “Notice and opportunity to repair construction defects: An imperfect response to a perfect storm.” Real Property Trust Estate Law J. 43 (4): 729–796. 04519024-5 J. Leg. Aff. Dispute Resolut. Eng. Constr., 2019, 11(4): 04519024 J. Leg. Aff. Dispute Resolut. Eng. Constr. language processing system to extract precursors and outcomes from unstructured injury reports.” Autom. Constr. 62 (Feb): 45–56. https://doi.org/10.1016/j.autcon.2015.11.001. Zhang, L., and B. Ashuri. 2018. “BIM log mining: Discovering social networks.” Autom. Constr. 91 (Jul): 31–43. https://doi.org/10.1016 /j.autcon.2018.03.009. Downloaded from ascelibrary.org by University Of Auckland on 08/25/19. Copyright ASCE. For personal use only; all rights reserved. Röder, A. B., and A. Hinneburg. 2015. “Exploring the space of topic coherence measure.” In Proc., 8th ACM Int. Conf. on Web Search and Data Mining. New York: Association of Computing Machinery. Tixier, A. J. P., M. R. Hallowell, B. Rajagopalan, and D. Bowman. 2015. “Automated content analysis for construction safety: A natural © ASCE 04519024-6 J. Leg. Aff. Dispute Resolut. Eng. Constr., 2019, 11(4): 04519024 J. Leg. Aff. Dispute Resolut. Eng. Constr.