Cadence Solutions was engaged by the Content Management unit within the Technology Department of a public sector organization (“Organization” or “PSO”) to examine the functionality and suitability of Microsoft Syntex as it relates to document classification within the PSO’s SharePoint Online (SPO) environment. Specifically, PSO desired to test whether the machine teaching capabilities of Syntex would be sufficient to auto-apply a Functional Classification Taxonomy (FCT) metadata element to a document.
The FCT is an information management (IM) tool to help users functionally classify records. Functional classification is categorizing records based on which business activities created the records. The FCT is a functional classification scheme that groups all PSO business activities as a hierarchy of functions (top level) and activities (bottom level). Users must functionally classify a record in SPO with a combination of a function and one of its activities (a functional class), like FIN - Budgeting and Forecasting.
Cadence Solutions was specifically engaged by the Department to perform the configuration work, provide training to the Department, and to provide this report as a perspective on this project. Cadence Solutions performed this work pro-bono using trial licenses provided directly to the Organization through their enterprise agreement with Microsoft.
Cadence Solutions was provided a set of sample documents fitting each of the five (5) selected FCT terms. These sample documents were stored on the SharePoint site established for the project in the Organization’s test environment. Cadence Solutions’ technical team proceeded to establish the connection to Syntex, train five (5) models.
At the closure of Cadence Solutions’ configuration effort, the Organization was left with five (5) trained Syntex document classification models which were able to apply the FCT to test documents.
The following report captures the highlights of this project and suggests future areas of investigation for the Department.
What is Microsoft Syntex?
Microsoft Syntex is a set of tools integrated in the Microsoft 365 environment providing a wide set of features related to unstructured content. While the brand is much broader than the scope of this project, it is valuable to consider the breadth of the features available. This project focused on a subset of the available features, but for clarity, we have copied the broad descriptions from Microsoft in the following paragraphs as an introduction to the Syntex platform as a whole.
Microsoft Syntex is Content AI integrated in the flow of work. It puts people at the center, with content seamlessly integrated into collaboration and workflows, turning content from a cost into an advantage. Syntex automatically reads, tags, and indexes high volumes of content and connects it where it’s needed—in search, in applications, and as reusable knowledge. It manages your content throughout its lifecycle with robust analytics, security, and automated retention.
Whether you’re focused on customer transactions, processing invoices, writing a contract that requires a signature, or struggling to understand the flood of unstructured content, Content AI with Syntex can help.
In the current state, much of the classification of documents is manual. From a net new document perspective, that creates extra work on the part of the user to ensure a document is classified appropriately at the time of upload. Gaps in training or general human error raise the risk of misclassification of documents which has downstream impacts on the retention policy and on the overall compliance to record keeping legislation. As it relates to existing documents, there are likely documents currently in the SharePoint environment that are misclassified or not classified at all with an FCT. Additionally, migration of existing content from other repositories into SharePoint requires substantial manual effort to ensure an appropriate FCT is applied to all in-scope documents. The gaps in this process are a significant barrier to enforcing appropriate content management and the gaps carry significant risk as it relates to compliance. In summary, the manual and error-prone process of classifying documents with an appropriate FCT is a risk to the Organization, preventing an effective and efficient transition to unified digital record keeping in SharePoint.
How was Syntex applied?
Using the Document Processing features of Syntex, we trained five (5) Document Processing* models in Syntex to apply a defined FCT based on similarity to training documents within each model. *: A ‘Unstructured Document Processing Model’ is a Machine Teaching model which takes in sample documents with human intervention to train the model on recognition of key phrases, which it then uses to apply a classification.
Five Models trained using Syntex
The following are the 5 models and their respective FCTs. The models in Syntex are classified as a Content Type with their respective names using Syntex:
GOV - Strategic Reporting
Director’s Orders – Consumer Protection
MOC - Enforcement
RM - Strategic Risk Assessment
Mineral Assessment Reports
NR - Mineral Rights Administration
GOV - Legislation
Syntex model classification and FCT extraction process
There are two main processes to link the models with the FCT. The first process is to classify a document as a model (content type) using the Syntex classifier. After the document is classified as a model, the next step is to extract the FCT value out from the model.
The process of classifying the models and mapping them to the FCT involved the following steps:
Configure Syntex Model Classifier
We created 5 models from the Syntex Content Center using the Unstructured Document Processing Model, also known as Document Teaching method. Syntex models have a 1-1 mapping to the SharePoint Content Type, so a unique Content Type was created for each model created.
Once the models were created, we trained each of the model’s classifier based on the training documents that were provided. The training leveraged explanations provided by the user to teach the model to correctly identify the positive and negative examples for the model. The explanations for each model had key phrases and/or proximities defined to teach the model to associate those terms with the model.
Once training was done, we performed a test on the model classifier to validate that the models were being classified correctly as per our training.
Configure Syntex Model Extractor
After the model classification was fully configured, we proceeded with the next step, which was extracting the FCT. Since FCT provided by PSO is stored in SharePoint Term Store, we needed to utilize an extractor that would extract a value from the model to associate with the FCT.
Test Models on a Document Library
We have tested the model classification/extraction on the SPO site and derive the following statistical results:
Classification Accuracy: 97%
Confusion Matrix: 97%
Average Confidence Score: 99.45%
Syntex model test results
Model Configuration Result
The following is the summary of the result of the models’ configuration on the FM1(Syntex) Content Center:
Classifier Accuracy (/100)
FCT Accuracy (%)
Director’s Orders – Consumer Protection
Mineral Assessment Reports
Classifier Accuracy: determined by the result of the Syntex model classifier’s training.
FCT Accuracy: determined by the result of the Syntex model FCT extractor’s training. Classifiers for most models had an accuracy of 100, and the average of all models were 98.8.
Lower classifier accuracy was correlated with the models not having a standard format.
FCT Extractor’s accuracy during training was lower, averaging 88.2%. Lower FCT extraction rate also was correlated with models with multiple unstructured formats.
FCT Extractor’s accuracy for Environmental Assessment was improved in the actual extraction process by implementing Method B for FCT extraction. Method B does not rely on the FCT Accuracy to extract the FCT value.
Overall, the configuration of the model surpassed the original expectation.
Model Classification/Extraction Result
The following is the model classification/extraction result of Syntex model tests on the SPO site:
Legend: (-)=Negative, (+)=Positive, (-)=False Negative, (+)=False Positive
Director’s Orders – Consumer Protection
Mineral Assessment Reports
The following is the statistics of the Syntex model test results on the SPO site:
Number of correct records/ Total number of document samples
True Positive + True Negative/Total Sample
Expected True Positives/Actual True Positives + Actual False Positives
Average Confidence Score
Average Syntex confidence score based on the model explanations
Using Statistical measures such as precision and accuracy, we can analyze the observed data from the Syntex models to derive a meaningful output. Classification accuracy refers to the average of discrepancy between observed and expected results, while precision describes how close the observed results are with each other. Ideally, precision should be high (close to 1), with a good set of actual results that have mostly true positives and true negatives. A precision score of 1 is achieved when the numerator and denominator are equal. The effectiveness of a model in making predictions can be evaluated using a confusion matrix, which accounts for the number of false positives and negatives in the calculation. The average confidence scores can also be used to assess the accuracy of the model's predictions. Overall, the statistical measures suggest that the model is effective in classifying documents.
The following items summarize the successful results experienced by the team during this project:
The trained models were able to successfully apply an FCT to documents stored in SharePoint Document Libraries.
Models with a templated structures were more easily classified and was able to extract the FCT value out more correctly.
Extraction leveraging the assignment of a default FCT value (Extraction Method B) was able to extract the FCT values out of documents without any templatized structure. This method also does not depend on the explanation and the accuracy of the FCT extractor as it relies on the extractor to fail to assign the default value. Also, there is no reliance on utilizing the synonyms of the terms in the Term Store. Overall, Method B seems to be solid method to use when there is a 1-1 mapping between the model (content type) and the extracted value (FCT).
Syntex reported a classification accuracy of 97%, confusion matrix of 97%, and precision of 0.98. This means that the model was effective in classifying the documents.
Model classifiers’ explanations are better to be simple unless stress testing finds the necessity for improvements.
Cadence Solutions Opinion on the suitability of Syntex
As it relates to the initialquestion of whetherSyntexcan auto-apply an FCT classification onto a document based on contents, we have found that Syntex is able to do that successfully. Additionally, we have found the reported FCT extraction resultsindicate that the approach of training the models was very successful.
FAQ from Project Team
The following questions were asked by the project team, with responses in-line:
What are the machine learning components of the Document Understanding Model to refer to it as Content AI?
Document understanding is based on AI solution named machine teaching. You can learn more about machine teaching here.
How are negative examples used in the Document Understanding Model?
Both positive and negative examples are used to train the model. For instance, if you are training a model to identify contracts, you may want to add service agreements as negative samples. These are documents that may look like contracts but are not contracts. The machine teaching overview above explains this in detail.
If one negative example is identified, would similar negative examples be identified without changing the Explanation?
Yes, the idea here is to teach Syntex on negative samples. When evaluating documents, if Syntex comes across similar negative sample documents, those will not be considered to be positive input. Machine teaching overview above explains this as well. How is the confidence score calculated? The confidence score from the explanations is set by Phrase list, Regular expression & Proximity. The better these are setup and accurately identifying the unstructured data the better the confidence score. Read more here.
Is there a limit of how many Syntex models we can apply to a single Document Library? Are there any other limitations?
One library can have more than one model deployed and files will be evaluated against every model. Whichever model produces the highest confidence score will be assigned to the document. For now, there is no defined upper limit of how many models you can apply.
Download the infographic below for more information.
About Cadence Solutions
Jordan Uytterhagen founded Cadence Solutions starting on the client side of the table. His mandate has been to help organizations struggling with digital transformation implement projects without losing their trust and confidence. Our solutions include automation of human resources, finance, accounts payable, contract management, document capture, drawing and records management, as well as managed services. Cadence Solutions has proven, time and again, that our client's projects will be successful because we are authentic with unmatched experience.