[bug #49014] Zombies in parallel builds with pselect code

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
URL:
  <http://savannah.gnu.org/bugs/?49014>

                 Summary: Zombies in parallel builds with pselect code
                 Project: make
            Submitted by: joerg
            Submitted on: Tue 06 Sep 2016 12:38:41 PM GMT
                Severity: 3 - Normal
              Item Group: Bug
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
       Component Version: 4.2.1
        Operating System: POSIX-Based
           Fixed Release: None
           Triage Status: None

    _______________________________________________________

Details:

When using the new pselect based job server logic, I regularly see hanging
pkgsrc bulk builds on NetBSD 7.0, both i386 and amd64. Most common victims are
huge packages like libreoffice. The symptoms are gmake processes with exactly
one zombie child per concurrent job. Since disable pselect by overriding the
configure test, the problem has disappeared.




    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #1, bug #49014 (project make):

I've had a couple of hung makes in the last couple of months where I've
noticed a prominent number of zombies.  I see my from-git make from
2016-06-11, reporting itself as 4.2.1, has HAVE_PSELECT=1 in config.status and
imports that symbol from glibc.  The hang has always been deep in a big build,
I think after I've suspended and resumed the build with SIGSTOP and SIGCONT to
a large process group.  Neither time did I find anything tractable, feared it
might be a kernel bug in:

Linux swiftboat 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 GNU/Linux

Guess I should try strace()ing the daunting number of make processes next time
and looking at the signal masking in /proc.  Sorry for the lame me-too.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #2, bug #49014 (project make):

Without pselect, I semi-reliable get errors like:

gmake[3]: *** duping jobs pipe: Bad file descriptor.  Stop.

variants with ENOENT etc exists. According to

https://secure.freshbsd.org/commit/freebsd-ports/r421562

this happens on FreeBSD with pselect as well.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #3, bug #49014 (project make):

Happened to me again.  All the zombies were preceded in the process tree by a
make blocked on the read of job_fds[0] in the HAVE_PSELECT jobserver_acquire.
I had, once again, sent the whole process group SIGSTOP and SIGCONT at some
point.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #4, bug #49014 (project make):

I took a look through the code.  I could to do a more comprehensive review but
at first glance it looks like everything is OK: if PSELECT is enabled then we
block SIGCHLD right at the start of the make process and we never unblock it
except as a parameter to pselect().

However, I think that pselect() in the *BSD systems is not POSIX-compliant:
according to the FreeBSD 5.0 man pages:

_The pselect() function is implemented in the C library as a wrapper around
select()._

It is not possible to correctly implement pselect() in the C library using
select(); it MUST have kernel system call support.  Otherwise there is a very
real race condition that cannot be closed.  I'm not sure that there's any way
to detect pselect() without kernel syscall support from autoconf, so the best
I can do (assuming that my diagnosis is correct) would be to detect the target
platform and disable pselect() support on BSD targets.  If you elect to
discuss this issue on any BSD mailing lists please CC me.

Regarding the Linux issue with SIGSTOP and SIGCONT, I can't explain that.  I
hate to suggest a kernel bug because 99.9% of the time you end up with egg on
your face, but I do wonder if there's an issue with SIGSTOP/SIGCONT and
pselect() where it's losing a signal somewhere.  Also I suppose Linux 3.2 is
pretty old at this point; perhaps someone on the kernel lists might have a
thought.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #5, bug #49014 (project make):

(1) I'm pretty sure that pselect can be implemented correctly in userland.
(2) It doesn't matter as the original bug report was from NetBSD, where
pselect is a system call.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #6, bug #49014 (project make):

It can't be implemented in userland.  What pselect() does is unblock the
signal, call select, then have the signal blocked again on return.  The point
of using pselect() is that the signal unblock/block must be atomic with the
select() system call.  Otherwise between the instant you unblock the signal
and invoke select(), the signal could arrive and you miss it.  Ditto for after
select() returns and before the signal is blocked again.  There's no way to
avoid that from userland.  The Linux man page for pselect has a good
explanation, as does this LWN article: https://lwn.net/Articles/176911/

I looked at the NetBSD man page but it didn't say clearly one way or the other
whether it used a system call or not.  If you've checked and verified it uses
a system call that's good.

I have to say I still don't see what the bug could be.  Stripping out all the
ifdefs etc. the handling of SIGCHLD is very simple when HAVE_PSELECT is set so
it's hard to know what might be wrong.  I think we may have to involve the
NetBSD devs to talk about how pselect() works there.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #7, bug #49014 (project make):

It's a red herring. FreeBSD has been providing a system call since at least
version 8.1 and 9.0. A correct userland implementation can be done on top of
other primitives like ppoll or kqueue, but that's not relevant here.

The common element for all the reports so far is that the file descriptor for
the jobs pipe got closed and something else is reusing the FD. This is much
more noticable when disabling the pselect use it seems.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #8, bug #49014 (project make):

> Also I suppose Linux 3.2 is pretty old at this point

I'm now on:

Linux swiftboat 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64
GNU/Linux

... and still seeing the same behavior as I reported in comment #3, several
times now.  That too is no spring chicken, it's true.  I wonder if it would be
worth my while trying to reproduce it on something with fewer feet in the
grave:

Linux balance 4.9.0-1-amd64 #1 SMP Debian 4.9.2-2 (2017-01-12) x86_64
GNU/Linux

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #9, bug #49014 (project make):

This issue also affects Make 4.2.1 on Linux 4.11.0-gentoo. I had a parallel
build of Firefox stall in a make process with two unreaped children. Manually
sending a SIGCHLD to the zombies' parent (make) process got everything moving
again.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #10, bug #49014 (project make):

Matt's work around worked for my latest recurrence.  I wonder if it gives us a
clue as to the cause?  I hadn't thought to try it and it's helpful not to have
to kill the build, so thanks.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #11, bug #49014 (project make):

SIGCHLD doesn't always unstick the make process, though. It worked for me when
I had a make with two zombie children, but it didn't work on another make that
had only one zombie child.

I think I'll try comment #0's suggestion of overriding the HAVE_PSELECT
autoconf test so I can go back to the tried and true job control code.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #12, bug #49014 (project make):

A fix for bug #51159 has been pushed to the Git repository.

I have a strong suspicion that this is the same problem we're seeing here and
that fix will also fix this issue.

Knowing the problem I thought I might be able to conjure a scenario to force a
repro case for this bug on my system, and indeed I could easily construct
scenarios where the read token was stolen, but I failed to reproduce this bug
(make always made progress eventually and completed; there were no outstanding
zombies) so I can't be sure.

However, if people who do see this can apply the change in Git SHA
b552b0525198 to their versions and retry with pselect() enabled I would
appreciate it.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #13, bug #49014 (project make):

I've applied b552b0525198 to Make 4.2.1 on my Gentoo system (and have
re-enabled HAVE_PSELECT). I will report back if I experience another
deadlock.

Thanks for the (probable) fix!

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #14, bug #49014 (project make):

Any updates on this?  Have you seen the pselect() zombie issue re-appear after
applying the patch?

Thanks for testing!

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #15, bug #49014 (project make):

I've been running with the git head as of your 2017-06-04 request.  The
problem hasn't recurred for me.  I don't think that means as much as it would
have a few months ago because I've had less cause to run my
"pgid=$(find-pid-of-root-of-big-recursive-make-job); while pid-still-going
$pgid; do kill -STOP -$pgid; sleep 1; kill -CONT -$pgid; sleep 1; done" script
recently.  Still, looks hopeful.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Follow-up Comment #16, bug #49014 (project make):

I have built several large software suites (including KDE Frameworks and
Plasma Workspaces) since applying this patch, and I have not seen the zombie
problem recur.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug #49014] Zombies in parallel builds with pselect code

Robert Morell
Update of bug #49014 (project make):

                  Status:                    None => Duplicate              
             Open/Closed:                    Open => Closed                

    _______________________________________________________

Follow-up Comment #17:

I'm going to close this as resolved by the fix for bug #51159.  If this is
seen again please add a comment.

Thanks for all the testing and investigation!

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49014>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Loading...