[bug #58979] Recursive make using jobserver hangs at completion

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
URL:
  <https://savannah.gnu.org/bugs/?58979>

                 Summary: Recursive make using jobserver hangs at completion
                 Project: make
            Submitted by: None
            Submitted on: Tue 18 Aug 2020 10:07:20 PM UTC
                Severity: 3 - Normal
              Item Group: Bug
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
       Component Version: 4.3
        Operating System: POSIX-Based
           Fixed Release: None
           Triage Status: None

    _______________________________________________________

Details:

Hello,

I updated from make 3.81 to 4.3 and started getting a hang when make is
exiting after completion.  It appears to be blocked on a read (jobserver pipe,
perhaps??).  strace shows that the process is blocked on a read of fd 4.

Any help would be greatly appreciated.

-Dave Hefner

Here's a snippet from the file I attached:

Using host libthread_db library "/lib64/libthread_db.so.1".
0x0000003834cdb7f0 in __read_nocancel () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003834cdb7f0 in __read_nocancel () from /lib64/libc.so.6
#1  0x00000000004275e5 in jobserver_acquire_all () at src/posixos.c:207
#2  0x0000000000418e51 in clean_jobserver (status=0) at src/main.c:3436
#3  clean_jobserver (status=0) at src/main.c:3411
#4  0x0000000000418f2b in die (status=status@entry=0) at src/main.c:3480
#5  0x0000000000408e37 in main (argc=<optimized out>, argv=<optimized out>,
envp=<optimized out>) at src/main.c:2613
(gdb) info threads
  Id   Target Id                                Frame
* 1    Thread 0x7fe7571c9700 (LWP 10262) "make" 0x0000003834cdb7f0 in
__read_nocancel () from /lib64/libc.so.6


Here are my platform details:

[root@spiderman-00 L1]# make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
[root@spiderman-00 L1]# uname -a
Linux spiderman-00 2.6.32-573.el6.x86_64 #1 SMP Thu Jul 23 15:44:03 UTC 2015
x86_64 x86_64 x86_64 GNU/Linux
[root@spiderman-00 L1]# cat /etc/redhat-release
CentOS release 6.7 (Final)
[root@spiderman-00 L1]#




    _______________________________________________________

File Attachments:


-------------------------------------------------------
Date: Tue 18 Aug 2020 10:07:20 PM UTC  Name: makeHang.txt  Size: 4KiB   By:
None
Debugger output and other details
<http://savannah.gnu.org/bugs/download.php?file_id=49687>

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/


Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
Follow-up Comment #1, bug #58979 (project make):

Hm.  Linux 2.6 is very old and I assume your libc is similarly old.  As an
experiment I suggest re-running make's configure with the flag
--disable-posix-spawn and rebuilding make, then seeing if that version of make
works better.

Otherwise it will be tricky to debug this: clearly the top-level makefile
thinks that there are more outstanding jobs to be run, but they are
(presumably) all dead which means something took a job token and didn't return
it before exiting.

Does this happen always, or is it intermittent?

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/


Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
Follow-up Comment #2, bug #58979 (project make):

I've run it several times.  It failed every time.

I configured with --disable-posix-spawn before I reported the issue.  All of
the data I attached was generated with it disabled.

If it would help, I suppose that I can revert to 3.81 and see what happens.

Thanks for the help!

-Dave Hefner

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/


Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
Follow-up Comment #3, bug #58979 (project make):

David, can you please attach the makefiles which reproduce the issue?

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Сообщение отправлено по Savannah
  https://savannah.gnu.org/


Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
Follow-up Comment #4, bug #58979 (project make):

Well, that's very good news (that it fails every time), in terms of tracking
it down.

The fact that it fails every time for you and doesn't fail for anyone else (I
should mention that at $DAYJOB I build software of significant size on CentOS
6.5, 6.6, 6.7 systems (as well as many other versions) with parallelism
enabled with no problems) leads me to suspect something specific to your
environment.

The fist thing you should check for are any make warnings printed about the
jobserver during your build.  If you see that it can be an important clue.

The way that the jobserver works is that it uses pipes shared between
recursive invocations.  These pipes are intended to be used only by GNU make.
However, if some other process happens to read data from these pipes, or write
data to these pipes, then it will corrupt the jobserver's idea of how many
outstanding jobs exist.

GNU make will close these pipes for any process it starts which it does not
think is a make process.  It uses the standard method for deciding this:
either the variable $(MAKE) (or ${MAKE}) is referenced in the recipe, or the
recipe is prefixed with a "+".

This is the same algorithm that "make -n" uses to decide what to run and what
to not run.

So... one option would be to run "make -n" and see if you still get the hang
behavior.  Hopefully you will!  If so, then you should examine carefully the
programs that are invoked during "make -n" (the ones actually invoked, not
just printed out!) and see if you can detect any of them doing weird things
with file descriptors that they don't own or which were passed to them.

Examine your recipes and look for ones that use the "+" prefix, and try
removing that from those recipes and see if it helps.

Examine recipes that invoke sub-makes using $(MAKE) or ${MAKE} and see if
those variables appear as part of more complex recipe lines that might do
something fancy with file descriptors; see if you can move the sub-make
invocation into a separate line in the recipe so only the actual sub-make is
invoked with "make -n" not other things.

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/


Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
Follow-up Comment #5, bug #58979 (project make):

I will work on gathering the make files.  It will take time that I don't have
right now.

I do have some more data about this problem.  I run this build system on many
different machines.  The machines have varying numbers of cores, processor
speeds, etc.

On some machines, make failures are rare or non-existent.  The machine that
consistently fails is somewhat unique, in that it has 6 cores.  It may be
total coincidence.  But I wonder if there's a race condition somewhere.  Some
race conditions are sensitive to processor timing, system load, etc.

I also tried running the build with no -j argument.  It takes a very long
time, but it succeeds on the machine that consistently fails with -j6.  As I
mentioned, we were not seeing a problem running 3.81.

Cheers,
Dave Hefner

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz
stepping        : 2
microcode       : 43
cpu MHz         : 1200.000
cache size      : 15360 KB
physical id     : 0
siblings        : 6
core id         : 5
cpu cores       : 6
apicid          : 10
initial apicid  : 10
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm arat epb xsaveopt pln
pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2
erms invpcid
bogomips        : 3795.59
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/


Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
Follow-up Comment #6, bug #58979 (project make):

I definitely am not saying there's not a bug in GNU make.  However, note that
traditional race conditions aren't common for GNU make because (a) it's
single-threaded and (b) the interaction between processes is mediated by
writing/reading a single byte to a pipe, which is managed by the kernel (which
presumably does not have race conditions!)

The only possible area where races can occur (barring a bug in the kernel) is
during signal handling.  This could be happening, as this is not a trivial
area, but just to note that I personally use GNU make 4.3 with varying levels
of -j on Linux systems of all different versions and hardware of all types, on
large builds MANY times a day, and I've not seen these issues.  So a bug in
GNU make may exist but it must be extremely difficult to hit.

Just to be up-front, I personally will likely not have time to reconstruct and
debug a complex makefile environment (particularly if the problem is very
intermittent/doesn't happen for me).  Perhaps Dmitry will have more time for
this.

If you do have a repro case it may be more feasible for us to suggest ways for
you to debug the environment that fails.  I've already suggested one way
forward, using -n.  I'd be interested to know if that had any results.

Other options likely involve modifying the GNU make code.  For example we
could point you to the places in the code where we read a byte from the
jobserver pipe and write a byte to the jobserver pipe.  If you print a debug
note every time that happens we may discover something interesting.
Particularly if we see a matched set of read/write but there are still bytes
missing, then it's clear something else is reading the pipe that shouldn't be.

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/


Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
Follow-up Comment #7, bug #58979 (project make):


Thanks!

I've attached my makefiles.  I have some additional data:

1. Per your suggestion, I ran with '-n.'  It revealed nothing interesting.

2. I captured the output from the entire build process and scanned it for make
warnings/errors.  There were none.  I intentionally whacked a file and
verified that I got the "Waiting for unfinished..." message.

3. I modified my automation scripts to run without '-j', then let it run on a
few dozen machines of varying distro, architecture, core counts, virtual
machines, etc.  I got no make failures on any machine.  (Unfortunately, I
can't really perform the inverse test.  This is an automated test lab, and I
can't bring it down, at least intentionally.  :-)

I like your idea of pointing me at a place where I can insert a test/message
in the code.  I haven't looked at the code at all.  But if there's a way to
build it with debug support, or instrumentation, I can easily do that.

Thanks for your help!

-Dave Hefner


(file #49695)
    _______________________________________________________

Additional Item Attachment:

File name: makefiles.tar.gz               Size:10 KB
    <https://file.savannah.gnu.org/file/makefiles.tar.gz?file_id=49695>



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/


Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
Follow-up Comment #8, bug #58979 (project make):

> I've attached my makefiles.

i guess, a clarification is needed.

The attached makefiles are a part of a bigger system. The other part is
missing. It is not possible to reproduce the issue with the attached makefiles
for anyone who is missing the other part.

Can you write the smallest possible makefile (or a set of makefiles), such
that even people who don't have your full build environment can run to
reproduce the issue?


> Per your suggestion, I ran with '-n.'  It revealed nothing interesting.

Did it hang or not?

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/


Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
Follow-up Comment #9, bug #58979 (project make):


[comment #8 comment #8:]
> > I've attached my makefiles.
>
> i guess, a clarification is needed.
>
> The attached makefiles are a part of a bigger system. The other part is
missing. It is not possible to reproduce the issue with the attached makefiles
for anyone who is missing the other part.
>
> Can you write the smallest possible makefile (or a set of makefiles), such
that even people who don't have your full build environment can run to
reproduce the issue?
I will try to do that.
>
>
> > Per your suggestion, I ran with '-n.'  It revealed nothing interesting.
>
> Did it hang or not?

No. It did not hang.

I designed this build system many years ago.  I believe we were running 3.80
at the time.  I confess that I have spent very little/no time tracking changes
to GNU make.  Mea culpa.  I have adjusted for some of the breaking changes,
but I have never put '+' on any recipes.  When is that needed?



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/


Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
Follow-up Comment #10, bug #58979 (project make):

> No. It did not hang.

-n causes make to run only recursive commands. And the hang does not reproduce
with -n. Which makes us suspect all the other (not recursive) commands.

There are atleast the following debugging options.

Remove recipies one by one until the hang is gone.

or

1. Add logging to jobserver_setup to print the pipe fd.
2. Add a sleep in makefile at the very beginning to give you time to run
auditctl.
3. Run make and see which fds are allocated for the pipe.
4. Run auditctl to see all processes which open your pipe, write to your pipe,
read from your pipe.

or

1. Add logging to jobserver_setup to print the pipe fd.
2. Add a sleep in makefile at the very beginning.
3. Run make and see which pid it has.
4. See in /proc/<make pid>/fd/<pipefd> the pipe id. It'll look like
$ ls -l /proc/92678/fd/5
lr-x------ 1 dgoncharov who 64 Aug 25 17:46 /proc/92678/fd/5 ->
pipe:[97436149]. Notice pipe id. In this case 97436149.
5. Run lsof |grep <pipe id> repeately in a loop and redirect the output to a
file. It'll look like
$ lsof |grep -- '97436149 pipe'
make       92678     dgoncharov    5r     FIFO                0,8       0t0
97436149 pipe
sleep      92679     dgoncharov    1w     FIFO                0,8       0t0
97436149 pipe
sleep      92679     dgoncharov    5r     FIFO                0,8       0t0
97436149 pipe

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Сообщение отправлено по Savannah
  https://savannah.gnu.org/


Reply | Threaded
Open this post in threaded view
|

[bug #58979] Recursive make using jobserver hangs at completion

David Boyce-5
Follow-up Comment #11, bug #58979 (project make):

> but I have never put '+' on any recipes.  When is that needed?

'+' cause make to keep the jobserver pipe fd open on exec of that command and
also run the command regardless of -n, -p, -q.

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58979>

_______________________________________________
  Сообщение отправлено по Savannah
  https://savannah.gnu.org/