Had a doozy of an issue the other day.  All of a sudden, a SharePoint farm that has been chugging along with no changes suddenly started having some weird issues.  Users could open, view, edit documents, but as soon as they attempted a save or an upload of a new document things started to go bad.  If they were using Windows Explorer they received the error: “The specified network name is no longer available”

SharePoint Custom Solution Crashes IIS Worker Proces - Windows Explorer Error

If they were using the GUI the upload form hung for a while and eventually reverted to “The Page Cannot be Displayed”

At the same time, we were getting reports of users in other areas of the farm getting a very slow response within SharePoint.  What was really confusing about this was that the issue was happening to just a single site collection in the farm.

Errors Received

Windows Event Log

We were receiving a number of errors besides those at the end user level.  The server event log indicated our app pool was crashing.  The error received was actually a warning (to me if an app pool is crashing, it should be an error) with the msg:

A process serving application pool ‘SharePoint Web Apps’ suffered a fatal communication error with the Windows Process Activation Service. The process id was ‘6292’. The data field contains the error number.

SharePoint Custom Solution Crashes IIS Worker Proces - EventLogError

In the multiple WFE environment it was happening back and forth between the two serves indicating the load balance was doing its job.  It also indicated why people were seeing slow response.  Each time the app pool failed, it had to restart and then reload the SharePoint environment (like you see after an IIS Reset).

ULS Logs

The ULS logs were something else.  In this particular environment our logs usually range from 5MB-40MB in size for a 30 min period.  When I ran a one minute log export using “Merge-SPLogFIle” the exported file was 1.3 GB.  Nothing screamed error at me, however there were a couple of things standing out.

06/21/2017 10:30:38.00        w3wp.exe (0x112C)        0x1E58        SharePoint Foundation        Performance        naqx        Monitorable        Potentially excessive number of SPRequest objects (16) currently unreleased on thread 46.  Ensure that this object or its parent (such as an SPWeb or SPSite) is being properly disposed. This object is holding on to a separate native heap. Allocation Id for this object: {C3DC973B-90B4-4974-A33D-A5A05A722DF7} Stack trace of current allocation:    at Microsoft.SharePoint.SPGlobal.CreateSPRequestAndSetIdentity(SPSite site, String name, Boolean bNotGlobalAdminCode, String strUrl, Boolean bNotAddToContext, Byte[] UserToken, SPAppPrincipalToken appPrincipalToken, String userName, Boolean bIgnoreTokenTimeout, Boolean bAsAnonymous)     at Microsoft.SharePoint.SPWeb.InitializeSPRequest()     at Microsoft.SharePoint.SPWeb.EnsureSPRequest()     at Microsoft.SharePoint.SPSite.OpenWeb(String strUrl, Int32 mondoHint)     at Microsoft.SharePoint.SPSite.OpenWeb(Guid gWebId, Int32 mondoHint)     at Microsoft.SharePoint.SPSite.OpenWeb(Guid gWebId)….

06/21/2017 10:31:21.09        w3wp.exe (0x112C)        0x1E58        SharePoint Foundation        General        8m90        Medium        1045 heaps created, above warning threshold of 128. Check for excessive SPWeb or SPSite usage.        a8dafd9d-9faa-70d5-b0e7-8c1711386713

So this screamed of some custom code (which we do have) running that is not disposing of the SPSite or SPWeb objects properly.  Why it suddenly became a problem I don’t know.  We did have security patches happen on the server over the weekend.  I didn’t think it likely to be the cause as the environment had been used for a day and a half with no issues.  We backed out of the patch anyways, but didn’t affect the issue occurring.  What was also confusing was this issue was also occurring in our Pre-Prod environment.  The silver lining is now I could really do some troubleshooting without affecting sites that were functioning or production data.

I finally tracked down the issue to an event receiver we have running in our environment.  The project sites all of the same structure and it was decided that code would be used to enforce this structure.  To that end, event receivers were built to ensure folders at certain levels (library root, root +1 level and root +2 levels) were not deleted nor files or folders at those levels were added.  I took a guess that these event receivers were causing the issues.  Using PowerShell I removed the event receivers from a library being affected.  In case you need this for something else the code to remove a list event receiver is:

In the above code (which removes the event receivers from ALL specified libraies in ALL subsites) I used the event receiver class to find the items I wanted to remove.  You can also use .Name and .Assembly if you wish. I used Class simply because when the sites were created and the receivers attached, no names were given.  With the event receivers removed, users were now able to upload and save documents.  So I had indeed found the culprit.  Now to determine why.

I’ll cover the review of the code and the final determination of the cause of the issue in Part 2 of this series.

 

Thanks for reading!