eZ Community » Forums » Developer » Binary file search index creation...
expandshrink

Binary file search index creation debugging (3.7.4)

Binary file search index creation debugging (3.7.4)

Monday 23 October 2006 7:41:12 pm - 14 replies

I have seen an error attempting to follow the article:
<i>http://ez.no/layout/set/printarti...multiple_binary_file_types</i>

<b>Settings: Binary File : binaryfile.ini.append.php</b>

# cat settings/override/binaryfile.ini.append.php

# Here you can add handlers for new datatypes.

[HandlerSettings]

MetaDataExtractor[text/plain]=plaintext

MetaDataExtractor[application/pdf]=binaryfile

MetaDataExtractor[application/msword]=binaryfile

MetaDataExtractor[application/vnd.ms-excel]=binaryfile

MetaDataExtractor[application/vnd.ms-powerpoint]=binaryfile

# The full path to your log file (used for debugging/testing)

[BinaryFileHandlerSettings]

LogFile=var/log/index.log

Using eZ publish 3.7.4

Any suggestion or insight into following search index update <i>fatal</i> errors?

//kracker
<i>tv : theme : scrubs</i>

<b>Attempt #1</b>

# /usr/local/php4.4.4/bin/php -d memory_limit=456M -C update/common/scripts/updatesearchindex.php -s example_com-dev --logfiles --verbose --sql --db-user=db --db-password=db --db-database=example_com_dev_376

Using siteaccess example_com-dev for search index update
Starting object re-indexing
Number of objects to index: 5467
...................................................................... 1.280409731114%
...................................................................... 2.5608194622279%
...................................................................... 3.8412291933419%
...................................................................... 5.1216389244558%
...................................................................... 6.4020486555698%
...................................................................... 7.6824583866837%
...................................................................... 8.9628681177977%
...................................................................... 10.243277848912%
...................................................................... 11.523687580026%
...................................................................... 12.80409731114%
...................................................................... 14.084507042254%
...................................................................... 15.364916773367%
...................................................................... 16.645326504481%
...................................................................... 17.925736235595%
...................................................................... 19.206145966709%
...................................................................... 20.486555697823%
...................................................................... 21.766965428937%
...................................................................... 23.047375160051%
...................................................................... 24.327784891165%
...................................................................... 25.608194622279%
...................................................................... 26.888604353393%
...................................................................... 28.169014084507%
...................................................................... 29.449423815621%
...................................................................... 30.729833546735%
...................................................................... 32.010243277849%
...................................................................... 33.290653008963%
...................................................................... 34.571062740077%
...................................................................... 35.851472471191%
...................................................................... 37.131882202305%
...................................................................... 38.412291933419%
...................................................................... 39.692701664533%
...................................................................... 40.973111395647%
...................................................................... 42.253521126761%
...................................................................... 43.533930857875%
...................................................................... 44.814340588988%
...................................................................... 46.094750320102%
...................................................................... 47.375160051216%
...................................................................... 48.65556978233%
...................................................................... 49.935979513444%
...................................................................... 51.216389244558%
...................................................................... 52.496798975672%
...................................................................... 53.777208706786%
...................................................................... 55.0576184379%
...................................................................... 56.338028169014%
...................................................................... 57.618437900128%
...................................................................... 58.898847631242%
...................................................................... 60.179257362356%
...................................................................... 61.45966709347%
...................................................................... 62.740076824584%
...................................................................... 64.020486555698%
...................................................................... 65.300896286812%
...................................................................... 66.581306017926%
...................................................................... 67.86171574904%
...................................................................... 69.142125480154%
...................................................................... 70.422535211268%
...................................................................... 71.702944942382%
...................................................................... 72.983354673496%
...................................................................... 74.263764404609%
...................................................................... 75.544174135723%
...................................................................... 76.824583866837%
...................................................................... 78.104993597951%
...................................................................... 79.385403329065%
...................................................................... 80.665813060179%
...................................................................... 81.946222791293%
...................................................................... 83.226632522407%
...................................................................... 84.507042253521%
...................................................................... 85.787451984635%
...................................................................... 87.067861715749%
...............PHP Fatal error:  Call to a member function on a non-object in /web/dev/example.com/ezpublish-3.7.6/kernel/search/plugins/ezsearchengine/ezsearchengine.php on line 73

Fatal error: Call to a member function on a non-object in /web/dev/example.com/ezpublish-3.7.6/kernel/search/plugins/ezsearchengine/ezsearchengine.php on line 73

Fatal error: eZ publish did not finish its request
The execution of eZ publish was abruptly ended, the debug output is present below.

# date
Mon Oct 23 10:05:02 CDT 2006


<b>Attempt #2</b>

# /usr/local/php4.4.4/bin/php -d memory_limit=456M -C update/common/scripts/updatesearchindex.php -s example_com-dev --logfiles --verbose --sql --db-user=db --db-password=db --db-database=example_com_dev_376
Using siteaccess example_com-dev for search index update
Starting object re-indexing
Number of objects to index: 5467
...................................................................... 1.280409731114%
...................................................................... 2.5608194622279%
...................................................................... 3.8412291933419%
...................................................................... 5.1216389244558%
...................................................................... 6.4020486555698%
...................................................................... 7.6824583866837%
...................................................................... 8.9628681177977%
...................................................................... 10.243277848912%
...................................................................... 11.523687580026%
...................................................................... 12.80409731114%
...................................................................... 14.084507042254%
...................................................................... 15.364916773367%
...................................................................... 16.645326504481%
...................................................................... 17.925736235595%
...................................................................... 19.206145966709%
...................................................................... 20.486555697823%
...................................................................... 21.766965428937%
...................................................................... 23.047375160051%
...................................................................... 24.327784891165%
...................................................................... 25.608194622279%
...................................................................... 26.888604353393%
...................................................................... 28.169014084507%
...................................................................... 29.449423815621%
...................................................................... 30.729833546735%
...................................................................... 32.010243277849%
...................................................................... 33.290653008963%
...................................................................... 34.571062740077%
...................................................................... 35.851472471191%
...................................................................... 37.131882202305%
...................................................................... 38.412291933419%
...................................................................... 39.692701664533%
...................................................................... 40.973111395647%
...................................................................... 42.253521126761%
...................................................................... 43.533930857875%
...................................................................... 44.814340588988%
...................................................................... 46.094750320102%
...................................................................... 47.375160051216%
...................................................................... 48.65556978233%
...................................................................... 49.935979513444%
.....................................................................Segmentation fault

# date
Mon Oct 23 09:05:02 CDT 2006

Modified on Tuesday 24 October 2006 7:33:46 am by // kracker

Tuesday 24 October 2006 7:41:51 am

Might anyone have a suggestion on where I might need to change setting to enable further debug information to deduce the cause of the error?

The end result .. crash (of unclear cause) seems very similar to the one described in the article.

<i>"In addition, we found that both pstotext and xpdf caused the same problem with very large PDF files - the SQL INSERT statement got too large (this is related to the eZ publish indexer, not the parsing tools) and would crash the indexer, resulting in no further content being indexed. Addressing that, in addition to handling multiple binary types in one file, led us to write our own custom parsing plugin."</i>

//kracker
<i>tv : robot.chicken</i>

Tuesday 24 October 2006 7:06:42 pm

<b>Questions</b>

1. <i>The eZ publish 'var/' size question</i> - Does the process segfault because because of the 3500 MB of PDF, XLS, PPT, DOC files in this installation's 'var/' directory?

2. <i>The eZ publish 'version' question</i> - Does the process segfault because I'm trying to use 3.7.4 instead the very latest 3.8.x ?

3. <i>The eZ publish 'var/' file character contents question</i> - Does the process segfault because of the character content?

//kracker
<i>KMK - Koast II Koast - I've Had It</i>

Modified on Tuesday 24 October 2006 7:18:24 pm by // kracker

Tuesday 24 October 2006 11:19:51 pm

Hi Cracker

Now you have a a nut to crack blunk.gif Emoticon

(referring to your last post)
1: no
2: no
3: maybe, but unlikely

Try to adapt the updatesearch index script as I've done for the lucene plugin: this will give you a clue where the crash happened (in terms of object to index):

Dowload the file

http://pubsvn.ez.no/community/tru.../bin/php/updatesearchindexlucene.php

Look for the line

print ("Indexing " . $innerNode->attribute( 'name' ) . " with node id " . $innerNode->attribute( 'node_id' ) . " and url_alias " . $innerNode->attribute( 'url_alias' ) . $endl );

And add this before the eZSearch::addObject( $object ); line of your update script or a copy of it).

I actually identified even corrupt objects/nodes in particular sites with this update script when using the lucene plugin (which calls every node placement and attribute of objects: it is also a unique debugging tool for corrupted databases indeed happy.gif Emoticon)

hth

Paul

Thursday 26 October 2006 5:45:48 am

<i>@Paul</i>

Heh (nice quote), as always I thank you for the very helpful direction.

<i>@Everyone</i>

After taking a closer look into the eZ publish error log, <i>var/log/error.log</i> I found the following error which gave a description and the final query.

<i>"Query error: MySQL server has gone away. Query: INSERT INTO ......."</i>

<i>http://www.phpmyvisites.us/faq/me...ysql-configuration-51.html</i>
<i>http://www.oooforum.org/forum/viewtopic.phtml?p=180203</i>
<i>http://dev.mysql.com/doc/refman/4.1/en/gone-away.html</i>
<i>http://dev.mysql.com/doc/refman/4...rver-system-variables.html</i>

Yet seemingly even after changing a popular mysql configuration setting the problem persists. I added the setting, max_allowed_packet = 750M.

Cheers,
//kracker

<i>http://www.imdb.com/title/tt0105977/</i>

Thursday 02 November 2006 4:13:10 am

Hi kracker,

What a coincidence, just today Ive been doing some tests and it seems i'm having the same problem.

Im trying to upload a trimmed text version of the Php manual which is ~750K to a test installation of eZ 3.7.4.

1. If i try to upload without DelayedIndexing, i get a blank page and nothing gets indexed.
2. After enabling delayed indexing, the file uploads correctly.
3. If i run 'php runcronjobs.php' it goes to the new object, but stays there forever.

Running cronjobs/unpublish.php

Running cronjobs/rssimport.php

Running cronjobs/indexcontent.php
Starting processing pending search engine modifications
        Indexing object ID #94

I tried using the longer 'update/common/scripts/updatesearchindex.php' and i get:

Using siteaccess shop_admin for search index update
Starting object re-indexing
Number of objects to index: 34
..................Segmentation fault

Yet, when I check, it DOES manage to index at least some part of the file (i can search succsessfully for words and phrases that appear even at the end of the document)

I'm sure we'll find the solution somehow. By the way, the article on indexing_multiple_binary_file_types is very good.

Regards,

Felipe

I tried changing the mysql conf:
set-variable=max_allowed_packet=150M

But it seems to make no difference.

Modified on Thursday 02 November 2006 7:23:11 am by Felipe Jaramillo

Thursday 02 November 2006 7:40:18 am

<i>@Felipe</i>

Hey Felipe! Thanks for testing and reporting back your findings on this subject.

I really do appreciate it. I'm certain others have had these kinds of problems in the past and certainly in the future. Now's the time to weigh in and talk about it.

I've narrowed the problem a bit between eZ -> DB (MySQL) and most of my test results were similar to the ones you report.

I am working now on taking a different, new look at this very subject again here in the near future. I will have to report back my findings and end-game solution.

The answers are here right now, looking at us clearly in the eye, waiting to be seen ... so naturally I'm looking towards the solution in front of me.

<i>Cheers,
//kracker

/me says on irc://freenode.net #ezpublish, `the documentation defines the solution ....`</i>

KMK - In God We Trust (feat. Rude Boy)</i>

Modified on Thursday 02 November 2006 7:47:25 am by // kracker

Thursday 02 November 2006 4:46:20 pm

Hi kracker,

After further tests I can confirm the same behaviour in 3.7.5 on Windows XP.

I don't really see the answer that close. Seems that we'd have to look into the indexing code more carefully.

I did manage to get the html file indexing working using a modified copy of the word indexer and using "lynx --dump" as the command to execute.

Regards,

Felipe

Tuesday 07 November 2006 1:30:22 am

<i>It's a bug ... from a certain point of view ...</i>
I'll update again once I get just a little closer.

I found today, randomly browsing an issue entry which read to me surprisingly similar. <i>http://issues.ez.no/IssueView.php?Id=9300&activeItem=2</i>
I need to test this further first to see what if any results it yields.

A question passed my way today. Would I rather have a slower more reliable search (avoid bug via code) or fix the real reason for the bug (in php). What would you chose to solve?

//kracker
<i>intristicly - phone calls from the past unannounced ...</i>
<i>ps. damn, references are money makers!</i>

Tuesday 07 November 2006 1:47:22 am

Well...

The results were more or less the same, I expected it to die before 90% and it last failed around 83%, this time it completed 85%.

The following is the history of the attempt with <i>xdebug</i> enabled.

# /web/ini/bin/build_ez_search_index_for_example_com.sh
Using siteaccess example-dev for search index update
Starting object re-indexing
{eZSearchEngine: Cleaning up search data}
Number of objects to index: 5465
...................................................................... 1.2808783165599%
...................................................................... 2.5617566331199%
...................................................................... 3.8426349496798%

<snip />
...................................................................... 80.695333943275%
...................................................................... 81.976212259835%
...................................................................... 83.257090576395%
...................................................................... 84.537968892955%
...................................................................... 85.818847209515%
............................
Notice: Uninitialized string offset:  0 in /web/dev/example_com/ezpublish-3.8.5/lib/ezxml/classes/ezxml.php on line 178

Call Stack:
    0.0009      55024   1. {main}() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:0
 1256.9776   12000000   2. ezsearch::addobject() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:165
 1256.9778   12000048   3. ezsearchengine->addobject() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezsearch.php:87
 1256.9960   12025624   4. ezcontentobjectattribute->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/search/plugins/ezsearchengine/ezsearchengine.php:85
 1256.9961   12025720   5. ezxmltexttype->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezcontentobjectattribute.php:1134
 1256.9963   12026520   6. ezxml->domtree() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/datatypes/ezxmltext/ezxmltexttype.php:344


Notice: Uninitialized string offset:  -1 in /web/dev/example_com/ezpublish-3.8.5/lib/ezxml/classes/ezxml.php on line 256

Call Stack:
    0.0009      55024   1. {main}() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:0
 1256.9776   12000000   2. ezsearch::addobject() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:165
 1256.9778   12000048   3. ezsearchengine->addobject() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezsearch.php:87
 1256.9960   12025624   4. ezcontentobjectattribute->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/search/plugins/ezsearchengine/ezsearchengine.php:85
 1256.9961   12025720   5. ezxmltexttype->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezcontentobjectattribute.php:1134
 1256.9963   12026520   6. ezxml->domtree() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/datatypes/ezxmltext/ezxmltexttype.php:344


Notice: Uninitialized string offset:  -1 in /web/dev/example_com/ezpublish-3.8.5/lib/ezxml/classes/ezxml.php on line 342

Call Stack:
    0.0009      55024   1. {main}() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:0
 1256.9776   12000000   2. ezsearch::addobject() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:165
 1256.9778   12000048   3. ezsearchengine->addobject() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezsearch.php:87
 1256.9960   12025624   4. ezcontentobjectattribute->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/search/plugins/ezsearchengine/ezsearchengine.php:85
 1256.9961   12025720   5. ezxmltexttype->metadata() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezcontentobjectattribute.php:1134
 1256.9963   12026520   6. ezxml->domtree() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/datatypes/ezxmltext/ezxmltexttype.php:344

.......................................... 87.099725526075%
.............
Fatal error: Call to a member function on a non-object in /web/dev/example_com/ezpublish-3.8.5/kernel/search/plugins/ezsearchengine/ezsearchengine.php on line 74

Call Stack:
    0.0009      55024   1. {main}() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:0
 1266.4872   12292752   2. ezsearch::addobject() /web/dev/example_com/ezpublish-3.8.5/update/common/scripts/updatesearchindex.php:165
 1266.4873   12292800   3. ezsearchengine->addobject() /web/dev/example_com/ezpublish-3.8.5/kernel/classes/ezsearch.php:87


Fatal error: eZ publish did not finish its request
The execution of eZ publish was abruptly ended, the debug output is present below.

//kracker
<i>if you don't use xdebug...
then you don't know ...</i>

Modified on Tuesday 07 November 2006 2:56:58 am by // kracker

Tuesday 07 November 2006 11:51:11 pm

Ive looked into the search engine indexing code and found two things which may cause the problem:

\kernel\search\plugins\ezsearchengine

LINE 105:                $wordArray = split( " ", $text );

This splits the text all at once. In my experience this takes a long, long time. I guess maybe it would be better to do it in different parts.

The place where it breaks is when words are indexed in 1000 word groups. i think it goes up to 127.000 words before breaking:

LINE 129:

 $db =& eZDB::instance();
        $db->begin();
        for( $arrayCount = 0; $arrayCount < $wordCount; $arrayCount += 1000 )
        {
            $placement = $this->indexWords( $contentObject, array_slice( $indexArray, $arrayCount, 1000 ), $wordIDArray, $placement );
        }
        $db->commit();

I did try putting the db->begin() and db->commit() inside the for loop, to favour several transactions instead of a huge one, but it didn't help.

On windows, PHP crashes after reaching the 127.000th word.

I upload the complete PHP manual in html which is about 12 MB and then do indexing through the shell.

It's interesting to see that the text gets stripped off the html tags:

Line 94:
$text = eZSearchEngine::normalizeText( strip_tags(  $metaDataPart['text'] ), true );

So this way, you don't need to do a 'lynx --dump' to index html files as they will get stripped of html tags anyway. Therefore to index html, one could use the plaintext plugin:

settings/override/binaryfile.ini.append.php

[HandlerSettings]
MetaDataExtractor[text/html]=plaintext

Hope this helps a bit.

Regards,

Felipe

Modified on Tuesday 07 November 2006 11:56:17 pm by Felipe Jaramillo

Wednesday 08 November 2006 12:59:07 am

Felipe:

I did a test with the PHP-manual like you did. I get a segmentation fault in PHP as you did. This is caused by a reference count overflow in PHP. You can see the description here: http://ilia.ws/archives/5-Top-10-ways-to-crash-PHP.html

I have been provided with a patch for PHP which tells me exactly which variable that is causing the reference count overflow and results in PHP segfaulting. I am currently looking into getting help from PHP-core-developers to interpret the result, and possibly making a modification to the script that avoids the problem.

Friday 10 November 2006 12:08:02 am

Hi Kristian,

Thanks for your input. I look forward to the results of your research as this is an essential feature in eZ.

Felipe

Wednesday 15 November 2006 5:33:46 pm

Update on the subject:
http://lists.ez.no/pipermail/sdk-public/2006-November/002463.html

Saturday 14 April 2007 6:27:08 pm

Greetings Felipe!

Brookins Consulting has recently released a new eZ Publish extension which provides scripts which will successfully index your eZ publish content objects without errors via an alternative replacement for the stock eZ publish search index update shell scripts.

Project, <i>http://projects.ez.no/updatesearchindex</i>
Release News, <i>http://projects.ez.no/updatesearc...x_extension_0_0_1_released</i>
Download extension, <i>http://projects.ez.no/content/dow...tesearchindex-0_0_1.tar.gz</i>

Feel free to ask further questions about this solution within the project forum, <i>http://projects.ez.no/updatesearc...tesearchindex_forum#msg451</i>

expandshrink

You must be logged in to post messages in this topic!

36 542 Users on board!

Forums menu

Proudly Developed with from