GemStone

Bug Note Report for 6.5

Closed bugs:

Created Summary Ver Status Bugnote title Bugnote description Workaround
09/02/10 Server swallows exception during cq execution 6.0 closed Failure message with CQ execution is not reported at client. The client does not report the cause of the failure that happens during CQ execution; event though the server logs shows the actual error message. The user has to look into the server log to see the error message.
08/11/10 socket-lease-time has no impact due to typo p2p.idleConnectionTime[out] 6.0 closed Connections continue to be closed even when socket-lease-time="0" Connections continue to be closed even when socket-lease-time="0"
08/05/10 primary HA region queues are not balanced closed Primary client subscription queues are not balanced While client subscription queues were balanced fairly across servers, primaries were not properly balanced, causing performance problems. Now, primary queues are much more balanced.
08/05/10 Unexpected ServerOperationException caused by CacheClosedException closed PutAll partial result behavior Partial result will not return ServerOperationException caused by CacheClosedException to user application. In stead, user application will get CancelException directly.
07/19/10 destroying region hung waiting for replies on vm waiting for destroy lock closed Hang creating a region with concurrent region destroy If a member is creating a region while other members are doing a distributed destroy of the same region, that member could hang while creating the region in rare cases.
06/30/10 Gateway shutdown hang: GemFireCache.close -> GemFireCache.stopServers => GatewayImpl.stop => PoolImpl.acquireConnection 6.0 closed Hang closing a gateway during network partition If a gateway is closed before the gateway has established a connection to the remote side, closing the gateway may hang if a network partition occurs.
06/28/10 NullPointerException for DistributionManager.getChannelId 6.0 closed NullPointerException thrown by DistributedSystem.connect() It is remotely possible that DistributedSystem.connect() will throw a NullPointerException in DistributionManager.getChannelId(). This can happen when enable-network-partition-detection has been enabled and the connection attempt succeeds but is immediately disconnected by a network partition event.
06/22/10 member hangs in DistributedSytem.connect() [ClientGmsImpl.findInitialMembers] during network partition 6.0 closed Hang during DistributedSystem.connect() It is remotely possible for DistributedSystem.connect() to hang in ClientGmsImpl.findInitialMembers(). Thread dumps will show another thread named "UDP ucast receiver" blocked in PingWaiter.getPossibleCoordinator(). This can happen if the system is attempting to connect to a locator that was running on a machine that crashed during the connection attempt.
06/14/10 DiskAccessException while creating diskStore caused by java.io.IOException: Input/output error closed Input/output error when creating a disk store on NFS mount We have observed that when persisting to an NFS mount on redhat 5 we occasionally see this error when creating the persistent store: java.io.IOException: Input/output error.
05/07/10 Interaction between registerInterest and eviction produces incorrect number of entries in region closed Incorrect number of entries in client region after registerInterest If a client region is configured with eviction, the eviction stats can be inaccurate after a call to registerInterest with InterestResultPolicy.KEYS_VALUES. This will result in evicting the wrong number of entries.
05/04/10 when a cached object changes from serialized to deserialized its size is not updated closed ObjectSizer not consulted when deserializing objects When using memory sized based eviction, an object sizer can be provided to ensure that gemfire accurately calculates the size memory usage of each object. This object sizer is not being consulted in certain cases when gemfire has the serialized form available for the object and then later deserializes it. Instead, gemfire remembers the serialized size. This can lead to inaccuracy in when gemfire performs memory based eviction.
04/29/10 Distributed deadlock when gateway startup is concurrent with ops and conserve-sockets=true closed Distributed deadlock when gateway startup is concurrent with ops and conserve-sockets=true In rare circumstances, startup of a gateway could hang when cache operations are concurrently occurring. This can only happen if the gemfire property conserve-sockets=true is set. This race condition has been fixed.
04/28/10 transportUdp: peer hangs in Flow control (replenishments) processing 6.0 closed hang in FC flow control protocol It is possible under heavy load with disable-tcp=true for the system to lose messages. This sometimes manifests as a hang in the com.gemstone.org.jgroups.protocols.FC protocol. The problem is caused by flaws in UDP message dispatching.
04/26/10 entry operations hang in waitForReplies from surviving side when network dropped (network partition tests) 6.0 closed operations hang waiting for replies from crashed machines with enable-network-partition-detection=true and IBM JVM Using the IBM 1.5 JVM we have found that invoking Thread.isAlive() or Thread.isDead() on a thread that is reading on a socket connected to a machine that has crashed can hang. This causes operations to block until the OS keepalive timeout expires. We have removed these checks when running in an IBM JVM when network-partition-detection is enabled.
04/20/10 hang in BucketAdvisor.releasePrimaryLock waiting for replies from member that was previously shutdown closed Hang closing a partitioned region If a member crashes while another member is closing a partitioned region or closing a cache containing a partitioned region, the member doing the close may hang while performing the close operation in rare cases.
04/20/10 While starting JMX Agent, there should be a way to configure the RMI Connector Server port. 6.0 closed Well-defined ports should be configurable for the Agent. Additional properties now to define well-known ports are: (1)rmi-server-port: The port on which the RMI Connector Server should start. (2)membership-port-range: The allowed range of UDP ports for use in forming an unique membership identifier. This range is given as two numbers separated by a minus sign. (3)tcp-port: TCP/IP port number to use in the agent's distributed system These properties are useful for starting the agent behind a firewall.
04/19/10 giiWhileMultiplePublishing fails when 3 of 10 (replicated) members do not have the entire keySet 6.0 closed message loss with disable-tcp=true It is possible for UDP messages to be lost under heavy load. This is caused by faults in the UDP unicast dispatching code.
04/16/10 Constraints on valid characters for use in region names should include OQL query string constraints 6.0 closed Querying on region with special characters Queries referring the region with special character (supported with regionName) are not supported. The support is added in 6.5.
03/31/10 shutDownAllMembers() appears to disconnect admin vm closed ShutDownAll assumptions ShutDownAll will only shutdown members with cache. Locator, admin members are not shut down.
03/29/10 NPE generated in DataSerializer.readClass if getContextClassLoader returns null 6.0 closed NPE in DataSerializer when using GemFire as an OSGi bundle When GemFire is used as an OSGi bundle, a NPE is thrown (visible only in fine logs)
03/22/10 Assertion thrown from RegionAdvisor.getBucket() during bucket recovery 6.0 closed Assertion error during bucket recovery After a new member joins, an assertion error could be thrown when we try to restore the redundant copy for a colocated partitioned region.
03/17/10 async disk region leaks memory 5.7 closed Memory leak with with async persistence or overflow With a region configured with asynchronous persistence or overflow, the disk region may create and retain many byte buffers while getting an initial image from another peer. After this point the byte buffers are not released, resulting in excessive memory usage.
03/12/10 ConcurrentModificationException thrown while iterating over DistributedRegion.getHeapThresholdReachedMembers HashMap 6.0 closed ConcurrentModificationException in DistributedRegion.getHeapThresholdReachedMembers() A ConcurrentModificationException may be thrown when a remote member exceeds Critical memory threshold.
03/10/10 tests with concurrent region (region create, region destroy) operations fail with OOME 6.0 closed EventTracker memory leak A small flaw in Region destruction causes the cache to retain references to EventTracker objects that should otherwise be discarded. EventTracker objects record information about which events have been applied to a cache Region. It could impact an application that has a high thread count across the distributed system and which performs a lot of Region destruction operations.
03/06/10 Need to remove the use of BlowFishJ from GemFire closed GemFire uses BlowFishJ GemFire no longer uses BlowFishJ, it has been replaced by JDK supported BlowFish algorithm.
02/22/10 test hangs while creating cache with ipv6 closed Hang creating a connection while admin console is running If there is an admin console running that is receiving alerts from the gemfire members, and a newly created VM can't connect to the admin console within p2p.handshakeTimeoutMS (60 seconds), the member in trouble could hang during the DistributedSystem.connect call.
02/11/10 Support to include keys as part of the CQ result set. closed CQ Results to include keys. When CQ is executed with "executeWithInitialResults" option, the resultset returned does not contain the keys as part of the result set, because of this it is harder to correlate between result set and the CQ events generated in later stages, the CQ Event includes the key on which the update happened.
02/09/10 Shutdown timeout with Distributed system shutdown hook waiting for responses to UpdateAttributes requests (from departed members) 6.0 closed hang during shutdown waiting for responses from departed members It is possible for the product to hang during shutdown, issuing a warning message that it has not received responses to a message from members that have shut down. This is caused by early termination of notification of membership changes in some parts of the product.
02/09/10 unexpected afterRemoteRegionCrash event refers to vm that should be healthy closed gemfire deadlocks and is kicked out of distributed system It is possible for gemfire to hang while attempting to send an alert to a member that is no longer there. The code sending the alert holds a lock that prevents the member from being able to respond to failure-detection probes or membership changes.
01/25/10 ClassCastException running an OQL query closed ClassCastException running an OQL query This is fixed in gemfire57_hotfix and is ported to GemFire 6.5.
01/25/10 hang creating region when peer logs that "Peer has disappeared from view" 6.0 closed hang attempting to connect to departed member If a new member happens to reuse the peer-to-peer port number of a recently departed member it is possible that the product will hang trying to communicate with the departed member after logging "Peer has disappeared from view". This is due to a bookkeeping error in membership management.
01/15/10 Managed Resources related to regions are not removed even after the region is destroyed/removed/lost. 6.0 closed Clean up managed resources in Agent created for regions in the Cache Managed resources are created in Agent for regions in the cache in a member of a distributed system. These are now removed when a region gets destroyed. Also there are four new notifications available for JMX clients through JMX on the MBeans - SystemMember and CacheVm. The notifications are: (1)gemfire.distributedsystem.cache.created - Creation of a cache on a member (2)gemfire.distributedsystem.cache.closed - Closure of a cache on a member (3)gemfire.distributedsystem.cache.region.created - Creation of a region in a cache on a member (4)gemfire.distributedsystem.cache.region.lost - Removal of a region from a cache on a member
01/12/10 Data consistency between CQ Result Set and the region data. 6.0 closed Data consistency between CQ Result Set and the region data. When CQ is executed using executeWithInitialResults option, there is a possibility that CQ can miss the events that is applied While resultset is being sent to client. This is fixed in 6.5 by queuing event that occurs during CQ execution on the client and replaying once CQ is completely initialized. NOTE: There is a possibility that the change may already reflected in the result set, still the CQ listener can see the same change (resulting in duplicate event), the client application need to manage the duplicate event (if it needs to ignore the event or apply the same on the result set).
11/24/09 Shutdown hang with ConcurrentModificationException thrown from LogWriterImpl.cleanUpThreadGroups during InternalDistributedSystem disconnect 6.0 closed DistributedSystem disconnect throws ConcurrentModificationException During shutdown it is possible for DistributedSystem.disconnect() to throw a ConcurrentModificationException. This can happen if an administrative member is disconnecting at the same time. The exception is thrown from LogWriterImpl.cleanUpThreadGroups().
11/24/09 SystemConnectException: Unable to become coordinator of existing group because no view responses were received 6.0 closed locator startup fails When authorization is used or enable-network-partition-detection is enabled it is possible for locator startup to fail with the message "Unable to become coordinator of existing group because no view responses were received".
11/18/09 peer PR member misses destroy (while performing bucket gii) during rebalancing 6.0 closed Missing CQ event when bucket re-balance in progress This is an missing event issues. This was first seen in eventFilterOpt branch and is fixed in 6.5 release.
11/11/09 Getting a server's PR entry from a client doesn't update its lastAccessedTime closed lastAccessedTime on an entry does not reflect when the entry was accessed last from any client in the system This is a trade-off unlikely to be ever changed. In order to scale gets, we allow gets to be satisfied from primary or secondary data stores. The lastAccessedTime is maintained locally on the store. So it is likely that key X has been fetched on a secondary recently but has idle timed out on the primary due to load balancing. We do ensure that when an entry expires out on a primary, it is removed from the entire system
11/11/09 createCQfetchInitialResult fails, Caused by: NPE from CqService.executeCq() 6.0 closed NPE with CQ Execution Reported when CQ is executed. One cause of this bug was unsynchronized code that establishes the identity of a client based on its first connection's port. Fixed in GemFire 6.5 release.
11/05/09 Unexpected replies processed in bridge servers 6.0 closed warning messages in logs about unexpected replies When using Delta, if one of the members has a region with DataPolicy EMPTY, the following warning message is logged "Received reply from member <memberId> but was not expecting one."
10/28/09 locator fails to start with GemFireConfigException closed locator fails to start with GemFireConfigException If the system property gemfire.locators is used to configure the locators setting and the property doesn't include the locator being started, startup will fail with a GemFireConfigException {{{ com.gemstone.gemfire.GemFireConfigException: Unable to contact a Locator service. Operation either timed out or Locator does not exist. Configured list of locators is "[frodo:15964]". at com.gemstone.org.jgroups.protocols.TCPGOSSIP.sendGetMembersRequest(TCPGOSSIP.java:183) at com.gemstone.org.jgroups.protocols.PingSender.run(PingSender.java:82) at java.lang.Thread.run(Thread.java:619) }}} As a workaround, make sure that the gemfire.locators property includes the locator being started.
10/13/09 ClassNotFoundException when DataSerializer attempts to deserialize an object array that has an array component type closed Deserializing a multidimensional array fails If you serialize an array that array fields with DataSerializer, gemfire will throw a ClassNotFoundException when deserializing the array.
10/05/09 GII recipient could incorrectly ignore an event because it is marked as a possible duplicate closed Crash while creating a replicate region could result in a lost update In rare cases an update may be lost if one cache server is creating a replicated region and another cacher server with the same region crashes while applying them update from a client. After the crash, the cache server that just created the region may miss the update.
09/28/09 closing cache hangs waiting for replies from vm making no attempt to respond closed hang during shutdown with disable-tcp=true It is possible for the product to hang during shutdown when disable-tcp=true. This is caused by faults in the UDP unicast dispatching code.
09/25/09 Client id is not random enough (getting duplicates) closed Duplicate client cache ID It is possible for two client caches to use the same membership ID, causing servers to become confused and mis-deliver events. The caches must be running on the same machine for this to happen.
09/21/09 Eviction is not evicting the least recently used entries for normal regions 6.0 closed Entry other than the least recently used was evicted Eviction does not always evict the least recently used entry.
09/21/09 Hang with mix of gets and puts using same key with Partitioned region closed Hang with concurrent operations on the same key with statistics enabled In rare cases, concurrent operations on the same key in partitioned region can result in a hang if statistics are enabled.
09/17/09 InternalGemFireError: Assert thrown from partitioned.DestroyMessage during PR invalidate region 6.0 closed invalidateRegion() is not supported for PartitionedRegions PartitionedRegion now supports invalidateRegion() operation.
09/14/09 EnforceUniqueHostStorageAllocation flag prevents moving a bucket between two VMs on the same host 6.0 closed gemfire.EnforceUniqueHostStorageAllocation setting has an inattended impact on partitioned region rebalancing Setting the gemfire.EnforceUniqueHostStorageAllocation prevents buckets from moving one VM to another on the same host during a rebalance operation.
09/14/09 Enabling both eviction and expiration in a partitioned region leaves entries in the cache. 6.0 closed Partition Region eviction may prevent entries from expiring In prior version of GemFire entries would not get expired on partition region secondaries. This would occur if eviction of an entry in a partition region primary occurred before expiration, and the eviction action was "LOCAL_DESTROY".
09/14/09 Transactional entry-create in region destroyed within same transaction is unexpectedly processed by CacheListener and TransactionListener. 6.0 closed Transactional load does not cause conflict A load done to satisfy a get operation does not cause a CommitConflictException even though the same entry is modified by another thread.
09/14/09 Missing primary detected after member forcefully disconnected from DS (underlying InternalGemFireError: Trying to clear a bucket region that was not destroyed) 6.0 closed Redundancy not satisfied after network partition If network partition detection is enabled, in rare cases gemfire can fail to restore redundancy after the partition.
09/08/09 Memory leak of EntryExpiryTasks in BucketRegion.pendingSecondaryExpires closed Memory leak in partition region secondaries When an entry in a partition region secondary is destroyed, the expiration task associated with the entry is not released until the secondary switched to being the primary.
09/08/09 accessor vms hang in waitForPrimaryMember after a dataStore is forcefully disconnected from the DS 6.0 closed hang caused by alert listener notification It is possible for GemFire to deadlock trying to notify an admin member of an alert. Thread dumps will show a thread in ManagerLogWriter.notifyAlertListeners() with other threads waiting to lock the membership view.
09/07/09 JMX operation SystemMemberCache.getRegionSnapshot fails completely if creating snapshot for even one of the regions fails. 6.0 closed Occurrence of an exception in admin agent while retrieving region information would prevent the retrieval of region information for other regions on the member. The admin agent logs failures encountered while retrieving information about regions in a cache, and continues with the retrieval of information of the other regions on the member. In versions of GemFire Enterprise prior to 6.5, this failure would prevent the admin agent from retrieving information about all regions present in a cache. This behavior was most commonly seen when invoking the SystemMemberCache.getRegionSnapshot MBean operation.
08/30/09 hasDelta/toDelta are invoked on the client side even if Delta Propagation property is turned off 6.0 closed Client sends delta even if delta-propagation=false in the distributed system Client have no knowledge whether delta-propagation is turned on or off on the server, and attempts to send deltas during updates. This does not cause any data errors. The server handles the incoming delta bytes and does not propagate the update as a delta.
08/27/09 Cacheserver ignores log-file property from gemfire.properties file 6.0 closed Cacheserver script ignores log-file property When starting a cache server using the cacheserver script, the log-file property in a gemfire.properties was being ignored. Now, the search order for the log-file property is: 1. command line arg 2. gemfire.properties 3. cacheserver.log default.
08/27/09 New EventTrackers are not tracked properly by the ExpiryTask 6.0 closed A memory leak involving event trackers. The cache uses event trackers to ensure that we can detect duplicates coming in from a single thread (events that may been retransmitted due to primary servers going down). These trackers are supposed to expire after a specified idle timeout period. In 6.0, the expiration task was not removing these event trackers leading to a memory leak. This is an issue for long running systems where publishing threads keep changing over the lifetime of the system. This has been addressed in 6.5
08/26/09 JMX Agent error reading mcast-port property 6.0 closed Leading and trailing whitespace in property values would prevent a cache server or agent process from starting. Preceding or trailing spaces in the values in the gemfire.properties or the agent's properties files could result in exception preventing the process from getting launched. Now all values are trimmed of leading & trailing white spaces.
08/21/09 DataSerializer.register throws the wrong exception 6.0 closed DataSerializer.register throws incorrect exception type If the id specified for a DataSerializer type clashes with that of a type already registered with the data serialization framework, GFE throws an IllegalArgumentException instead of an IllegalStateException as documented. The exception message, though, correctly described the reason for this exception and also names the class that is also registered.
08/19/09 PartitionedRegion#getEntry can access an entry before it is created closed Early escape of Region.Entry from CacheWriter It was possible for the cache writer to get a reference to a Region.Entry before it was initialized. A call to getEntry now returns null.
08/18/09 Possible infinite loop in GrantorRequestProcessor.startElderCall closed Hang while closing a global region In rare conditions, closing a global region could result in a hang. This may cause other members to hang trying to lock entries while updating them.
08/17/09 Entries are lost in PartitionedRegions by cycling dataStore VMs 6.0 closed During HA event, destroy operation failed with EntryNotFoundException When a destroy operation is done on a PartitionedRegion and the primary member for that key crashes, an EntryNotFoundException may be thrown.
08/11/09 Iterating on PR local data invokes PartitionResolver closed Improvements to partition resolver PartitionResolver is now invoked only once per operation. Iterating over local data does not invoke resolver. Iterators from peer accessors does not invoke resolver in the accessor.
08/11/09 Region javadoc for putAll states it is unsupported on PR 5.7 closed Region javadoc for putAll states it is unsupported on PartitionedRegions The javadoc for putAll states: {{{ throws UnsupportedOperationException If the region is a partitioned region }}} This is a mistake, putAll has been supported on all region types since the GemFire 5.7 release. Customers using GemFire 5.7 or later are encouraged to use putAll on partitioned regions.
08/06/09 Reblancing colocated regions moves fewer buckets than expected 6.0 closed Reblancing colocated regions moves less data than expected Due to a bug in the rebalancing algorithm, gemfire does not move data during a rebalance even though it appears there is space for the data. This bug only appears when using colocated regions. Gemfire is erroneously comparing the total size of data to be moved for all of the colocated regions with the local-max-memory setting of each individual region. If the total amount of data is greater than the remaining capacity of the region, gemfire will not move the data. Increase the local-max-memory of all of the regions.
08/04/09 GemFire cannot serialize a String who's logical length is < 0xFFFF, but who's utf-8 encoded length is > 0xFFFF closed GemFire cannot serialize a String who's logical length is < 0xFFFF, but who's utf-8 encoded length is > 0xFFFF If you have a string with some multibyte characters that is less than 0xFFFF characters long, but will be more than 0xFFFF bytes when serialized using UTF, a UTFDataFormatException is thrown when serializing the string with gemfire.
07/29/09 IllegalArgumentException thrown if multiple regions configured using same EvictionAttributes 5.0 closed IllegalArgumentException thrown if multiple regions configured using same EvictionAttributes If a single instance of EvictionAttributes was shared among multiple region creations, an IllegalArgumentException was thrown. This is now fixed.
07/24/09 socket-buffer-size can not exceed 16,777,215 5.7 closed GemFire API allows socket-buffer-size to be configured to values greater than Java allows. Setting the "socket-buffer-size" to a value greater than 16,777,215 will trigger an exception: {{{ java.lang.IllegalStateException?: tcp message exceeded max size of 16,777,215 }}} Do not set the "socket-buffer-size" to a value greater than 16,777,215.
07/17/09 MulticastSocket.setInterface call fails on Windows Server 2008 closed GemFire cannot create a multicast socket on WIndows Server 2008, Windows Vista, or Windows 7 Due complications related to JGroups bug JGRP-777 GemFire throws an exception with the root cause stating "An operation was attempted on something that is not a socket" when configured to use Multicast for membership discovery on Windows Server 2008. {{{ Caused by: java.net.SocketException: An operation was attempted on something that is not a socket at java.net.PlainDatagramSocketImpl.socketSetOption(Native Method) at java.net.PlainDatagramSocketImpl.setOption(PlainDatagramSocketImpl.ja va:299) at java.net.MulticastSocket.setInterface(MulticastSocket.java:420) at com.gemstone.org.jgroups.protocols.UDP.createSockets(UDP.java:631) at com.gemstone.org.jgroups.protocols.UDP.start(UDP.java:502) at com.gemstone.org.jgroups.stack.Protocol.handleSpecialDownEvent(Protoc ol.java:874) ... 78 more }}} Use locators instead of multicast for discovery.
07/16/09 In RemoteGfManagerAgent, exceptions occurred while connecting to the DS and handling joined members should be handled properly. 5.8 closed Before failing to connect in distributed system due to missing license information on a member, the Agent should try every member of a distributed system If there is more than one member running and the agent fails to retrieve license information for the distributed system from the first member, the agent tries the next member. In addition, failure to retrieve the license information from one of the members is now logged at both the member and the agent.
07/13/09 Incorrect mbean descriptor in JMX AdminAgent 6.0 closed Incorrect descriptor JMX operation SystemMember.manageStat removed Removed non-existing operation descriptor manageStat that was described for SystemMember MBean.
06/26/09 Locators fail to start on Windows in Pure Java Mode 5.7 closed Locators fail to start on Windows in Pure Java Mode A locator cannot be started in pure Java Mode by using the following command-line: gemfire start-locator -port=8888 The locator.log has the following message: '" true true "' is not a valid IP address for this machine. Use the following command to workaround the issue by specifying values for the bind address, hostname for clients, and logfile. gemfire start-locator -address=%bindaddr% -hostname-for-clients=locator_%bindaddr% -Dgemfire.log-file=%logfile% where bindaddr is a suitable bind address for the machine and logfile is any filename other than "locator.log"
06/16/09 CQ doesn't send update events in case of evication (overflow to disk). closed CQ Events with update on evicted value When an update happens on the region entry whose value is written to disk, the cq applies the query condition on only new value, as the old value is not available during that case it just ignores applying the query condition on old value. The issue will be seen only if the event is not cached before. Fixed in 6.5.
06/12/09 Events due to eviction on PR are not firing closed CacheListener Events due to eviction on PartitionedRegions do not get invoked This bug impacts a region configured to be a PartitionedRegion with a listener and eviction. The expected behavior is that a listener would invoke the void afterDestroy(EntryEvent e) method whenever an entry was evicted from the cache. While eviction does take place, the listener event is not triggered. All other listener events do behave correctly though. Use Distributed Regions with a manual partitioning scheme.
06/11/09 region-time-to-live and region-idle-time have not been implemented for PR closed 'region-time-to-live' and 'region-idle-time' attributes have no effect on Partitioned Regions Distributed regions support 'region-time-to-live' and 'region-idle-time' expiration attributes for their entries. These expiration attributes are not supported in partitioned regions and are ignored.
06/03/09 HAClientQueues (not persistent) do not get deleted when client disconnects 5.7 closed Client queues on server may cause the server to lock up or run out of memory In cases where a client disconnect soon after connecting to a server, the client's queue did not cleaned up. If this happened frequently, these queues would cause the server to run out of memory, or the queues to fill up with events causing the server to lock up while trying to insert events into the queue. This has been fixed in GemFire 6.1.
06/02/09 A RuntimeException from a user's toData method causes a hang 6.0 closed A RuntimeException from a user's toData method can cause a distributed member to hang If a runtime exception is thrown from the toData method of a user's DataSerializable object while doing a distributed put, GemFire will become hung. Code toData methods defensively to catch RuntimeException and handle it in an alternate way.
06/02/09 Using multiple GII providers with a persistent region can resurrect destroyed entries 6.0 closed Using multiple GII providers with a persistent region can resurrect destroyed entries When more than one member with the 'provider' attribute set to true is present, a new member coming up does a union GII from all of the providers in addition to what is on disk. The result is that if there are entries on disk which have been destroyed in the providers, the new member will resurrect those destroyed entries.
05/27/09 Disk recovery fails if using -Duser.language=ja 6.0 closed Disk Regions do not function correctly if the locale's language is "ja", such as when -Duser.language=ja Due to an error in how the filename's prefix is handled by the localization code GemFire will fail to find a disk persistence file even if it exists at the path specified by the user's configuration. The code works correctly for all user language's except Japanese ("ja"). Setting the java system property user.language to English via the command line will avoid this problem. java -Duser.language=en ...
05/27/09 gateways are limited to 10G of persistence/overflow 6.0 closed gateways are limited to 10G of persistence/overflow In 6.0 gateways were changed to no longer roll oplogs. Gateways always have a single directory whose dir-size is the default of 10G. Note that dir-size only applies to oplogs but that is all a gateway has now since it never rolls. Once the oplogs on a gateway reach 10G the next write will fail with an out of disk space error.
05/21/09 Partitioned Region expiration does not distribute events 5.8 closed Destroy and invalidate events not sent to clients or cache listeners in a partition region When a Region with DataPolicy.PARTITION is configured with Eviction enabled, and with EvictionAction set to either DESTROY or INVALIDATE, an AFTER_DESTROY or AFTER_INVALIDATE event is not sent to cache client, or CacheListeners.
05/21/09 HeapLRU with ObjectSizer will expose CachedDeserializable instances to user code 6.0 closed Configuring HeapLRU with an ObjectSizer it will expose CachedDeserializable instances to application code If you configure a HeapLRU and an ObjectSizer for it then GemFire will mistakenly pass instances of our internal CachedDeserializable instances to the customers implementation of ObjectSizer.sizeof(Object) Customers can workaround this bug by adding the following code in any implementation of ObjectSizer. {{{ import com.gemstone.gemfire.internal.cache.lru.Sizeable; public class MyObjectSizer implements ObjectSize { public int sizeof(Object o) { if (o instanceof Sizeable) { return ((Sizeable)o).getSizeInBytes(); } // customer's sizeof code goes here } } }}}
05/20/09 Registering a function on a Java client changes the behavior when executing an instance of the function 6.0 closed Incorrect function may be executed in Execution.execute(Function f) API In prior versions of GemFire, the Execution.execute(Function f) API resulted in the execution of a function other than the one supplied as a parameter if the ID of this instance matched that of a function already registered on the server. The registered function was executed instead.
05/20/09 LIFO Eviction APIs should not be visible to customers 5.7 closed LIFO Eviction APIs should not be part of the public API The following methods and constants were intentionality exposed as part of the GemFire API. They are not intended for customer use and should be considered strongly deprecated. {{{ Package: com.gemstone.gemfire.cache EvictionAttributes#createLIFOEntryAttributes EvictionAttributes#createLIFOMemoryAttributes EvictionAlgorithm.LIFO_ENTRY EvictionAlgorithm.LIFO_MEMORY EvictionAlgorithm#isLIFOEntry EvictionAlgorithm#isLIFOMemory EvictionAlgorithm#isLIFO }}} Do not write code that makes use of this methods or constants.
05/06/09 FunctionService.onServers() does not execute on all servers but on the servers the pool is currently connected to 6.0 closed FunctionService.onServers() API may not execute on all servers in a pool In prior versions of GemFire, the FunctionService.onServer('poolName') API did not ensure that the function was executed on all servers configured in the pool. It is possible that at the time the function execution is initiated, the pool may not have an active connection to one or more of its servers. GemFire 6.1 fixes this and ensures that connections to all servers configured in the pool are active. If an attempt to create a connection fails, the function execution fails.
05/05/09 ResultsBag fromData() throws NPE. 6.0 closed NPE in ResultsBag fromData() This happened when ResultBag.fromData() is called. This is fixed in 6.5 and also ported to gemfire601_maint branch.
05/04/09 During start up, a process may try to connect to other processes even after it knew that those processes were gone closed DistributedSystem attempts to connect to members that have left It is possible that the DistributedSystem will become confused and attempt to connect to members that have left the system while it was starting up. When this happens you will see the departed members admitted into membership in "P2P message reader" threads. This happens when the departing members see the new member and connect to it, causing them to be "surprise members" to the new process.
04/28/09 oplog rolling fails reading with Bad file descriptor 6.0 closed Oplog roller fails with "Bad file descriptor" If oplog rolling is enabled and overflow to disk is configured then a small race condition exists in which the roller may fail causing the region to be closed. The following is an example failure: {{{ [info 2009/04/28 10:52:22.800 PDT <main> tid=0x1] Closing oplog early since it is empty. It is for region /myReg and has oplog#22 [error 2009/04/28 10:52:22.800 PDT <OplogRoller /myReg for oplog 22> tid=0xf] A DiskAccessException has occurred while writing to the disk for region /myReg. The region will be closed. com.gemstone.gemfire.cache.DiskAccessException: For Region: /myReg: Failed reading from "/export/jade1b1/users/darrel/gfbuild/BACKUP_myReg_22". oplogID = 22 Offset being read=10,300,824 Current Oplog Size=10,400,832 Actual File Size =10,400,832 IS ASYNCH MODE =false IS ASYNCH WRITER ALIVE=false, caused by java.io.IOException: Bad file descriptor at com.gemstone.gemfire.internal.cache.Oplog.basicGetForRoller(Oplog.java:3727) at com.gemstone.gemfire.internal.cache.Oplog.getBytesAndBitsForSwitchingEntry(Oplog.java:2356) at com.gemstone.gemfire.internal.cache.ComplexDiskRegion$OplogRoller.rollBackup(ComplexDiskRegion.java:919) at com.gemstone.gemfire.internal.cache.ComplexDiskRegion$OplogRoller.roll(ComplexDiskRegion.java:1157) at com.gemstone.gemfire.internal.cache.ComplexDiskRegion$OplogRoller.run(ComplexDiskRegion.java:1215) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: Bad file descriptor at java.io.RandomAccessFile.seek(Native Method) at com.gemstone.gemfire.internal.cache.Oplog.basicGetForRoller(Oplog.java:3694) }}} Setting the system property "gemfire.disk.KEEP_EMPTY_OPLOGS" to "true" will prevent this bug.
04/24/09 HeapLRUStatistics.heapUsage does not represent the amount of heap currently in use (in bytes) 6.0 closed HeapLRUStatistics.heapUsage stat removed HeapLRUStatistics.heapUsage stat has been removed, please refer to the ResourceManager stats instead.
04/23/09 socket/thread leak with conserve-sockets=false closed Thread and Socket leak when conserve-sockets=false When configured with conserve-sockets=false, GemFire may accumulate idle threads that have names similar to this: P2P message reader for ent(42524):2331/2296 SHARED=true ORDERED=false UID=1371 These threads and the sockets they are reading from are created to transmit message replies. They may accumulate if they were created for sole use by a particular thread and that thread no longer exists.
04/22/09 PR expiration with localDestroy fails with InternalGemFireError closed Using localDestroy as the expiration action for a PR throws InternalGemfireError Setting the expiration action of localDestroy on a partitioned region causes an InternalGemFireError to be logged. No expiration happens. Starting with version 6.0, setting the expiration action to localDestroy will throw an error on region creation. Use the destroy action instead. Don't use local destroy, use destroy instead. This expires all copies of the entry.
04/08/09 RegionMembershipListener.initialMembers is not invoked when added using AttributesMutator closed A RegionMembershipListener added after a Region is created does not have its initialMembers() method invoked If you add a RegionMembershipListener cache listener to a Region after the Region has been created, the listener will never have its initialMembers() method invoked. Only listeners added through cache.xml or through RegionAttributes at the time the Region is created will have their initialMembers() method invoked.
04/07/09 gii receives no response from source vm 6.0 closed hang creating region with disable-tcp set to true A bug in the startup code in the fragmentation protocol used for UDP messaging was found to cause a hang in region creation when the distributed system property disable-tcp is set to true. The hang is caused by a race condition that causes the member that is creating the region to ignore a message from a member that has been selected to send the contents of the region.
04/06/09 InternalGemFireException: While calling refresh() causedBy: javax.management.InstanceNotFoundException 6.0 closed InternalGemFireException received when invoking SystemMemberCache.getRegion(..) JMX API on the AdminAgent on IBM J9 JVM This is caused by a known issue in the IBM JVM. It may not occur consistently. The solution is to turn off JIT compilation for RegionStatisticsResponse.create(). Turn off JIT compilation for com.gemstone.gemfire.internal.admin.remote.RegionStatisticsResponse.create()
04/03/09 There is no error given when we try starting the agent specifying an incorrect path for its property-file. 6.0 closed Admin agent would silently apply default properties if it could not find its properties file. Admin agent used to silently apply default properties if it could not find its properties file. Now the agent adds a log entry when it applies default values for its configuration properties. The logged string is: "Using default configuration because property file was not found".
04/01/09 JMX Agent startup fails with ipv6 enabled 6.0 closed JMX agent fails to start when using IPv6 This problem occurs when using the default rmi-bind-address, "localhost", and IPv6 on a machine where the address returned by a call to java.net.InetAddress.getLocalhost() returns an IPv6 link-local address. This is primarily a Windows issue because of the IPv6 implementation requiring a link-local address to also be create when configuring a machine to support IPv6 and the order that these are created in varies from machine to machine. This error will manifest as an AgentImpl$StartupException. {{{ A quick synopsis of the stack is provided below: com.gemstone.gemfire.admin.jmx.internal.AgentImpl$StartupException: Failed to start RMI service at com.gemstone.gemfire.admin.jmx.internal.AgentImpl.startRMIConnectorServer(AgentImpl.java:1141) at com.gemstone.gemfire.admin.jmx.internal.AgentImpl.start(AgentImpl.java:263) at hydra.AgentHelper.startAgent(AgentHelper.java:129) at admin.AdminTest.startAgentTask(AdminTest.java:120) ... Caused by: java.io.IOException: Cannot bind to URL [rmi://:26120/jmxconnector]: javax.naming.NoPermissionException [Root exception is java.rmi.ServerException: RemoteException occurred in server thread; nested exception is: java.rmi.AccessException: Registry.Registry.bind disallowed; origin /fe80:0:0:0:21a:a0ff:fe27:ddbe is non-local host] ... Caused by: javax.naming.NoPermissionException [Root exception is java.rmi.ServerException: RemoteException occurred in server thread; nested exception is: ... Caused by: java.rmi.ServerException: RemoteException occurred in server thread; nested exception is: java.rmi.AccessException: Registry.Registry.bind disallowed; origin /fe80:0:0:0:21a:a0ff:fe27:ddbe is non-local host ... Caused by: java.rmi.AccessException: Registry.Registry.bind disallowed; origin /fe80:0:0:0:21a:a0ff:fe27:ddbe is non-local host }}} Specify an RMI bind address using the rmi-bind-address property: ./agent start rmi-bind-address=<ipv6 address> or in a gemfire.properties file rmi-bind-address=<ipv6 address> Second workaround: Edit the Windows hosts file, usually located in c:\WINDOWS\system32\drivers\etc\hosts to map a literal address to the hostname. Note entries are required for both IPv4 and IPv6 on machines that support both protocols (even for non-gemfire ) Create two entries: [ipv4 literal] [full qualified host] [optional short hostname] [ipv6 literal] [full qualified host] [optional short hostname] Example: 15.168.12.81 mymachine.gemstone.com mymachine fdf0:7c6f:eda8:9449::19 mymachine.gemstone.com mymachine
03/30/09 Distribution Locator Properties section in GFE SysAdminGuide might be confusing 6.0 closed Sys Admin Guide has incorrect Distribution Locator syntax System Administrator’s Guide -> chapter 8 -> section 'Distribution Locator Properties': The table of properties & the example below that mention properties required to use locators incorrectly. The locators property should be configured as: locators=host1[port1],host2[port2]
03/27/09 java.net.SocketException: Address family not supported by protocol family: bind encountered while starting bridge server 6.0 closed java.net.SocketException: Address family not supported by protocol family: bind encountered while starting bridge server When starting a GemFire cache server under Microsoft Windows, GemFire throws an exception when it tries to bind a server socket to an IPv6 address. {{{ java.net.SocketException: Address family not supported by protocol family: bind at sun.nio.ch.Net.bind(Native Method) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59) at com.gemstone.gemfire.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:336) at com.gemstone.gemfire.internal.cache.BridgeServerImpl.start(BridgeServerImpl.java:276) }}} This is caused by a JVM bug, #6230761, that causes Java "New I/O" sockets to not work with IPv6 on Microsoft Windows machines. GemFire 6.0 detects this condition and automatically sets max-threads to zero after issuing this warning: {{{ Ignoring max-threads setting and using zero instead due to Java bug 6230761: NIO does not work with IPv6 on Windows. See GemFire bug #40472 }}} To work around this problem, disable the thread pool in the GemFire server by setting max-threads to zero.
03/25/09 Suspect string DiskAccessException caused by ArrayIndexOutOfBoundsException 6.0 closed com.gemstone.gemfire.cache.DiskAccessException thrown when using persistent regions Previous versions of GemFire (6.0 and earlier) used to occassionally see an ArrayIndexOutOfBoundsException wrapped as a DiskAccessException. This was coming out of the JDBM code that we used in conjunction with tran logging in our persistence layer. The use of JDBM has been completely removed in 6.5
03/23/09 PartitionedRegion ops hang in waitForPrimary member after NPE thrown from BucketAdvisor.sendProfileUpdate() 6.0 closed NullPointerException from Thread.holdsLock with JRockit With the Jrockit VM, we have on rare occasions seen NullPointerExceptions from the java.lang.Thread.holdsLock method.
03/12/09 PR eviction to disk degrades with number of buckets closed 6.5 oplog new design 6.5's oplog design resolved this issue. All the buckets shared the same oplog file.
03/10/09 Hang while creating region during StateFlushOperation.flush 6.0 closed hang creating a region with scope distributed-no-ack and using disable-tcp=true It is possible for GemFire to hang when attempting to create a Region if the distributed system property "disable-tcp" is set to true and the distribution scope of the region is "distributed-no-ack".
03/04/09 SIGSEGV in CacheClientProxy with SUN JRE 1.6.0_10 closed SIGSEGV in CacheClientProxy with SUN JRE 1.6.0_10 SIGSEGV in CacheClientProxy with SUN JRE 1.6.0_10. This is observed with 6.0 and in 6.5 the later version of JDK is used.
02/26/09 NullPointerException in CacheClientProxy.processMessage closed NullPointerException in cache server during spike in data operations In very rare instances, a cache server would encounter a NullPointerException due to a race.
02/16/09 If roller is active at the time of region.close it can end up writing a dummy byte & thuse loose the original value closed Closing a persistent region results in a missing value In rare cases, closing a persistent region can lead to a single value in the persistent data being lost.
02/15/09 Test fails with Timeout during netsearch/netload/netwrite (IllegalMonitorStateException during pushing message ) 6.0 closed IllegalMonitorException exceptions with JDK 1.6 If you encounter IllegalMonitorStateExceptions while using GemFire with Sun's implementation of JDK 1.6, we advise using the VM option {{{ -XX:+UseHeavyMonitors }}}
02/09/09 BridgeServer with SELECTOR enabled shutdown timeout 6.0 closed GemFire hangs during attempt to close the cache Running on Microsoft Windows with the JRockit JVM, we have seen GemFire hang when an attempt is made to close the cache in a server VM. The hung thread will have a stack similar to this: {{{ -- Blocked trying to get lock: java/lang/Object@0x048F4128[thin lock] at jrockit/vm/Threads.sleep(I)V(Native Method) at jrockit/vm/Locks.waitForThinRelease(Locks.java:1209)[optimized] at jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1342)[optimized] at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1259)[optimized] at jrockit/vm/Locks.monitorEnter(Locks.java:2439)[optimized] at sun/nio/ch/WindowsSelectorImpl.wakeup(WindowsSelectorImpl.java:75) at com/gemstone/gemfire/internal/cache/tier/sockets/AcceptorImpl.close(AcceptorImpl.java:1548) ^-- Holding lock: java/lang/Object@0x048F3EB8[thin lock] at com/gemstone/gemfire/internal/cache/BridgeServerImpl.stop(BridgeServerImpl.java:351) ^-- Holding lock: com/gemstone/gemfire/internal/cache/BridgeServerImpl@0x04F329A8[thin lock] at com/gemstone/gemfire/internal/cache/GemFireCache.stopServers(GemFireCache.java:1118) ^-- Holding lock: java/lang/Object@0x04981AD0[thin lock] at com/gemstone/gemfire/internal/cache/GemFireCache.close(GemFireCache.java:913) ^-- Holding lock: java/lang/Class@0x0436A180[recursive] at com/gemstone/gemfire/internal/cache/GemFireCache.close(GemFireCache.java:793) }}} This is due to a flaw in JRockit's implementation of NIO socket selectors. GemFire v6.0 detects the use of JRockit on Windows and disables the use of NIO socket selectors after issuing this warning: Ignoring max-threads setting and using zero instead due to JRockit NIO bugs. See GemFire bug #40198
02/09/09 executeCqOnRedundantsAndPrimary throws CQException "Failed to execute the CQ ... Error from last server: Primary discovery failed" 6.0 closed Error while executing CQ. This was happening due to multiple threads accessing the same CQ. This is fixed in GemFire 6.0.
02/04/09 Hang in MapInterfaceTest.testBlockGlobalScopeInSingleVM 6.0 closed Distributed lock requests fail to timeout Lock requests may fail to timeout under certain conditions. A thread requesting a distributed lock may continue waiting beyond the configured lock-timeout or specified waitTimeMillis. This should be a temporary condition and the thread will eventually either acquire the lock after waiting longer than it should or it will timeout later than it should. The most likely condition leading to this is lock requests, or Global Region puts, initiated while locking is suspended or while the Global Region is initializing (get initial image) in any member of the distributed system.
02/02/09 Serialization types should be registerable via cache.xml declaration 5.7 closed Dataserializable types have to be programmatically registered with the GemFire server cluster In prior versions of GemFire, users were required to register types programmatically by defining a static initializer block on each VM that supplied the type of the class being registered. Starting GemFire 6.0, types can be defined declaratively in the cache.xml file using the following syntax. {{{ <serialization-registration> <serializer> <class-name>com.gemstone.util.MySerializer</class-name> </serializer> <instantiator id="101"> <class-name>com.gemstone.util.DateTest</class-name> </instantiator> <instantiator id="102"> <class-name>com.gemstone.util.IndexMap</class-name> </instantiator> </serialization-registration> }}}
01/28/09 lastModifiedTime from an empty region is 0 5.7 closed Expiration is broken when actions originate on a region with DataPolicy.EMPTY Prior to 6.0, the lastModifiedTime (used for calculating expiration time for an entry in a region) was being set to 0 if the entry was modified from a VM that had the region with a data policy set to DataPolicy.EMPTY, causing incorrect expiration behavior for the entry. In 6.0, the lastModifiedTime is propagated from the accessing node and applied correctly across the system.
01/24/09 OOME in parReg/parRegCreateDestroy 6.0 closed Server could run out of memory during rebalancing In prior versions of GemFire, creation and destruction of Partitioned Regions could eventually lead to the server running out of memory. This was most likely to occur during intensive re-balancing operations on the partitioned regions. This has been fixed in GemFire 6.1
01/23/09 CacheClientProxy stats leak 5.7 closed Garbage CacheClientProxy stats building up on the server Killing clients isn't cleaning up the CacheClientProxy stats for that client on the server side. Over time, these stat objects take up memory and CPU.
01/20/09 JMX Agent startup should ignore any gemfire.properties present in the path 6.0 closed Conflicting properties in gemfire.properties and agent's properties file could prevent the admin agent from functioning properly The agent now uses only the properties listed in its own properties file (default name: agent.properties or specified through property-file=<my agent's property filename>) and ignores the gemfire.properties file that may exist in either of: (1) The current directory, or (2) user home directory, or (3) the class path.
01/16/09 Assertion error while creating bucket in region.(Test:parReg/event/concParRegEvent.conf) 6.0 closed InternalGemFireError thrown when putting a value into a partitione region When calling Region.put(Object) on a Partitioned Region, it is possible that the region will throw an InternalGemFireError stating "Did not finish sending image, but region, cache, and DS are alive." This is caused by a faulty termination check in one of GemFire's data replication algorithms.
01/07/09 EnforceUniqueHostStorageAllocation allows bucket copies on the same host 6.0 closed Two copies of a bucket in the same host with EnforceUniqueHostStorageAllocation There is a small window where setting the EnforceUniqueHostStorageAllocation flag fails to prevent two copies of bucket from ending up the same host. This can occur when a rebalance operation is performed simultaneously with the first update to the bucket.
12/18/08 New vm unable to contact locator 6.0 closed GemFireConfigException states that no Locators could be contacted A GemFireConfigException with the text {{{ Unable to contact a Locator service. Operation either timed out or Locator does not exist. Configured list of locators is }}} (followed by a list of the configured locators) may be thrown when the locators were up and reported the VM correctly contacting them. The problem is caused by a race condition between two threads in JGroups startup code.
12/17/08 hang creating region when peer logs that "Peer has disappeared from view" closed Hang creating region when peer logs that "Peer has disappeared from view A vm logs that it did not receive all of the expected startup responses within 15 seconds, and then hangs trying to create a Region. Another vm logged that it failed to send a Startup response to the hung vm because it had "disappeared from view". The hang is caused by a race condition in the other vm that caused it to incorrectly shun the new vm.
12/17/08 ConcurrentModificationException during shutdown 6.0 closed ConcurrentModificationException thrown by DistributedSystem.disconnect() Under rare circumstances, it is possible for DistributedSystem.disconnect() to throw a ConcurrentModificationException. The property disable-tcp must be set to true for this to happen, and another vm must be starting up concurrently.
12/15/08 primary balancing after VM recycled not yet implemented 6.0 closed Primary buckets not balanced after recovery If a member hosting a partitioned region crashes and is subsequently restarted, it will not receive any primary buckets. This can lead to an imbalance in load across the members.
11/26/08 Hang in JChannel.disconnect() closed Hang in DistributedSystem.disconnect() waiting for JGroups to disconnect In very rare circumstances, the DistributedSystem.disconnect() method may hang trying to shut down the JGroups membership stack. This is due to a defect in the JGroups Promise class, and has been fixed in GemFire v6.0
11/07/08 IllegalThreadStateException thrown by JGroups JChannel when network dropped during DistributedSystem.connect() 6.0 closed IllegalThreadStateException thrown during DistributedSystem.connect() When attempting to connect to GemFire with DistributedSystem.connect(), in rare circumstances the method may throw an IllegalThreadStateException. We have observed this happening when enable-network-partition-detection is enabled in the distributed system properties and a network partition occurs during the connection attempt.
11/03/08 Queues are filling up and not draining in WAN tests closed WAN Gateways May Not Initialize Correctly There is a race condition when starting a gateway that may cause a primary gateway to never process any incoming events. This can be confirmed by identifying messages in the logs indicating that the gateway queues are not draining. Stop and restart the gateway.
10/29/08 BucketAdvisor fails assertion in Loner because of DummyExecutor 5.7 closed Partioned Regions are not supported for loner members Loner member (a GemFire connection defined by mcast-port of zero and no locators) should not use Partitioned Regions. Use a Local Region instead. Versions of GemFire prior to 6.0 may throw unexpected InternalGemFireErrors if attempting to use a Partitioned Region in a Loner, especially with redundancy > 0. GemFire 6.0 will allow this, but it's not a practical configuration except for testing purposes. {{{ Assertion error creating bucket in region com.gemstone.gemfire.InternalGemFireError: Attempting to sendProfileUpdate while synchronized may result in deadlock at com.gemstone.gemfire.internal.Assert.throwError(Assert.java:75) at com.gemstone.gemfire.internal.Assert.assertTrue(Assert.java:93) at com.gemstone.gemfire.internal.cache.BucketAdvisor.sendProfileUpdate(BucketAdvisor.java:808) at com.gemstone.gemfire.internal.cache.BucketAdvisor.acquiredPrimaryLock(BucketAdvisor.java:579) at com.gemstone.gemfire.internal.cache.BucketAdvisor$VolunteeringDelegate.doVolunteerForPrimary(BucketAdvisor.java:1443) at com.gemstone.gemfire.internal.cache.BucketAdvisor$5.run(BucketAdvisor.java:1398) at com.gemstone.gemfire.internal.cache.BucketAdvisor$6.run(BucketAdvisor.java:1645) at com.gemstone.gemfire.distributed.internal.LonerDistributionManager$DummyExecutor.execute(LonerDistributionManager.java:441) at com.gemstone.gemfire.internal.cache.BucketAdvisor$VolunteeringDelegate.execute(BucketAdvisor.java:1600) at com.gemstone.gemfire.internal.cache.BucketAdvisor$VolunteeringDelegate.volunteerForPrimary(BucketAdvisor.java:1396) at com.gemstone.gemfire.internal.cache.BucketAdvisor.volunteerForPrimary(BucketAdvisor.java:541) }}} Loners should use Local Regions. Partitioned Regions should only be used by a distributed system of two or more members.
10/27/08 JVM version issue for AIX 5.7 closed AIX JVM 1.6 version issue The BlockingHARegionJUnitTest will fail for 2 reasons: 1) it became very slow, and 30 seconds is not enough to feed 20000 entries while 1.5 and new 1.6 can. 2) the total region size will exceed 20000. We set the region capacity to 10000, it should only contain up to 20000 entries. 1.5. and new 1.6 do not have this problem. The root cause is the used AIX jvm version has problem. It is: java version "1.6.0-internal" Java(TM) SE Runtime Environment (build pap3260-20070819_01) IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 AIX ppc-32 jvmap3260-20070817_13537 (JIT enabled) J9VM - 20070817_013537_bHdSMR JIT - dev_20070817_1300 GC - 20070815_AA) The old workable 1.5 JVM is: java version "1.5.0" Java(TM) 2 Runtime Environment, Standard Edition (build pap32dev-20071008 (SR6)) IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 AIX ppc-32 j9vmap3223-20071007 (JIT enabled) J9VM - 20071004_14218_bHdSMR JIT - 20070820_1846ifx1_r8 GC - 200708_10) JCL - 20071008 The new workable 1.6 JVM is: java version "1.6.0" Java(TM) SE Runtime Environment (build pap3260sr2-20080818_01(SR2)) IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 AIX ppc-32 jvmap3260-20080816_22093 (JIT enabled, AOT enabled) J9VM - 20080816_022093_bHdSMr JIT - r9_20080721_1330ifx2 GC - 20080724_AA) JCL - 20080808_02
10/22/08 load may be invoked more than once for a single get closed Load may be invoked more than once for a single get In versions prior to 6.0, if the loader returned null, it would get invoked a second time. Starting 6.0, a return value of null is considered a successful invocation of the loader. The public javadocs on load now state this: {{{ @return the value supplied for this key, or null if no value can be supplied. A local loader will always be invoked if one exists. Otherwise one remote loader is invoked. Returning <code>null</code> causes {@link Region#get(Object, Object)} to return <code>null</code>. }}}
10/17/08 unnecessary credential verification being performed every 10 seconds 5.5 closed Unnecessary credential verification being performed every 10 seconds GemFire periodically retransmits membership information to all members of the distributed system. There is a flaw in the product that currently causes re-verification of security credentials when this happens. The retransmission period is based on the member-timeout setting of the distributed system and is currently set at twice the member-timeout interval.
10/16/08 Test hangs in CqService.closeNonDurableClientCqs() during shutdown 6.0 closed resolved JVM issue Same bug as 40490, 39130, 40243. It's been identified as a JVM issue and has been fixed in 1.6.0_14 and later. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6699669 Confirmed with Dick, we have suggested customers to use 1.6.0.17. So the problem will not been seen
09/26/08 New member incorrectly shunned 5.7 closed New member is incorrectly shunned by other members of the distributed system When a new member starts up and attempts to connect to the distributed system, it may hang trying to create tcp/ip connections to existing members of the system. This can happen if the new member uses the same UDP membership port as a recently departed member on the same machine. GemFire uses this UDP port and host address to identify members of the distributed system. When a member leaves the distributed system, it is shunned for a short period of time to prevent inappropriate communications from taking place. If this UDP port is reused, as can happen on some operating systems (Windows) more easily than others (*nix), the new member that is reusing the port will be incorrectly shunned by other members. Restart the application
09/26/08 Client throws an exception if it encounters UNDEFINED in query results 5.7 closed Client exception with UNDEFINED value in query results This could happen when compiled select encounters null/undefined value. This is fixed in GemFire 6.0 release.
09/22/08 losingSide VM does not process afterRegionDestroyed (FORCED_DISCONNECT) event and hangs in destroy operation after networkPartition 5.7 closed Network partition with a gateway enabled can result in hang If a network partition occurs in a site with a gateway, the gateway member may hang trying to process events.
09/19/08 bloom-vm failure with ServerConnectivityException: Pool unexpected socket timed out on client 5.7 closed Unexpected socket timed out on client with 1.6.0_5, 1.6.0_7 Due to a bug in the java, using Sun JDK 1.6.0_5, 1.6.0_7 and configuring the bridge server's max-threads setting to something other than 0 can result in the client seeing this error "ServerConnectivityException: Pool unexpected socket timed out on client" Set this system property to true to work around the issue, or upgrade to later JDK. -DCacheServer.NIO_SELECTOR_WORKAROUND=true
09/18/08 Updates can be lost with WAN Gateway failover in mlRioWithConflation 5.7 closed Updates can be lost during WAN Gateway failover when conflation is enabled With conflation enabled on a WAN gateway, if the primary gateway fails on the sending side, there is a small window where an event that occurs on the sending side can fail to be transmitted to the receiving side.
09/10/08 Need API for localPut on client closed Client side localPut API support. After further discussions, and given that we plan to simplify our region interfaces in the future to allow client only operations using the same API set that we have today, we decided to shelve this feature request.
09/09/08 getInitialImage misses a concurrent operation 5.7 closed New replicate region inconsistent with other replicates when transactions are being performed When transactions are being performed on a replicate region and another cache creates a new replicate of the region, the new replicate may miss operations performed in the transaction. There is no workaround. This bug is fixed in GemFire v6.0
07/29/08 getInitialImage test fails when multiple VMs miss a create event 5.7 closed Multicast may deliver no-ack events out of order When using multicast for message distribution with Regions having distributed-no-ack scope, operations may be applied out of order in other VMs. This is caused by a race condition between the multicast and unicast reader threads when multicast retransmissions are performed. use distributed-ack scope, or do not use multicast for distribution
07/28/08 Java-level deadlock in InternalDistributedSystem.disconnect 5.7 closed Java-level deadlock in InternalDistributedSystem.disconnect While rare it is possible to encounter a Java-level deadlock while calling DistributedSystem.disconnect()
07/26/08 JMX agent command line doesn't start agent closed JMX agent command line fails silently The JMX agent launcher does not correctly detect and report problems in starting the agent. For instance, if one of the TCP/IP ports is in use by another process, the agent will not start the service on that port but will launch without reporting any problems. Examine the agent.log file to see if there were any problems in launching the agent.
07/23/08 NPE thrown from IndexCreationMessage.operateOnPartitionedRegion 5.7 closed NullPointerException may occur when creating an index on a partitioned region A NullPointerException may sometimes occur when an index is being created on a partitioned region and a separate thread is removing the same index at approximately the same time. If this occurs, the NullPointerException can be safely ignored.
07/23/08 Reinitializing vms get tangled up trying to create indexes 5.7 closed Hang with index creation on partition region. A index creation could cause deadlock between the threads in two different vms in distributed system hosting the same partition region, because of synchronization code locking the same object while processing the request and response between the vms.
07/22/08 async writer thread will cause puts to lock forever if it exits closed Puts could hang when using asynchronous disk persistence In versions prior to 6.0, puts would hang if the disk persistence mechanism encountered a I/O error causing the disk writer thread to exit prematurely. This has been addressed in 6.0 and the disk writer thread does not exit prematurely on encountering any errors. It logs an exception which causes the region to be closed allowing other threads and members to continue.
07/01/08 Two VMs using different mcast addresses still discover each other 5.7 closed Distributed systems with different multicast addresses find each other and join the same group Due to the way Linux interprets RFC 1112, multicast sockets using the same port will receive datagrams from each other even if using different multicast addresses. {{{ See these links for more information: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=231899 http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4701650 http://www.uwsg.iu.edu/hypermail/linux/net/0211.1/0003.html }}} Make sure to select different multicast ports for different distributed systems to keep them isolated from one another.
06/25/08 The OQL TO_DATE function does not support minutes properly 5.5 closed OQL_TO_DATE function incorrectly processed the MM formatting token In versions prior to 6.0, the OQL engine does not distinguish between the formatting strings for month and minutes (MM and mm respectively). In 6.0, this has been addressed.
05/29/08 Redundant buckets should always be on different host when possible 5.5 closed Redundant copies of data should always be on different hosts when possible GemFire tries to locate redundant copies of data on different physical hosts to protect the system from process failure as well as machine failure. In situations where multiple hosts are not available, redundant copies may be colocated on the same machine, protecting the system against process failure but not machine failure.
05/23/08 Instantiators are not sent from server to client when client connects. 5.5 closed Clients did not receive instantiators already registered on the server Instantiators enable optimization of the deserialization of DataSerializable types. In prior versions of GemFire, a client connecting to a server may not always receive the instantiations already registered on the server. In GemFire 6.1, these registered instantiators are sent by the server to the client during the connection setup.
04/25/08 Conflicting transaction can proceed if both the transaction manager and grantor crash 5.5.1 closed Conflicting transaction can proceed if the transaction manager crashes while distributing the commit If the transaction manager (the member performing the transaction) crashes while transaction participants are in the process of applying the commit, then it's possible for a new transaction to begin and commit with key conflicts that are not detected. There is a workaround for application members. This bug can be prevented by adding a method call in members with regions that are involved in transactions. After creating the GemFire Cache, make this call: com.gemstone.gemfire.internal.cache.locks.TXLockService.createDTLS(); This only needs to be done once for any Cache instance.
04/15/08 Installer throws FileNotFoundException when it is run from a directory with spaces 5.5.0 closed GemFire Installation fails with java.util.zip.ZipException when run from a directory with spaces in it The GemFire installer does not correctly handle spaces in the name of the directory which contains the installer itself. Note this is literally the directory that the installer is in, not the directory the user selected as the destination. The failure comes with a stack trace like this: The system cannot find the path specified Exception in thread "main" java.util.zip.ZipException: The system cannot find the path specified at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.<init>(ZipFile.java:203) at java.util.zip.ZipFile.<init>(ZipFile.java:234) at ZipSelfExtractor.extract(ZipSelfExtractor.java:99) at ZipSelfExtractor.main(ZipSelfExtractor.java:34) Move the Gemfire installer jar into a directory without spaces and rerun it.
04/11/08 thin clients get unexpected nulls from bridge server 5.5 closed gets begin to return null for keys that are known to be in the cache The symptoms are that you have client connected to a bridge server and he is doing puts and gets and then after about 90 seconds all of your gets start returning null despite the fact that you had already put data for those keys into the cache. 90 seconds is roughly how long it will take for the hotspot to begin optimizing the ConnectionProxyImpl class in GemFire and then the problem is manifest. First verify that this isn't simply a case of your eviction policy causing your data to be evicted before you do a get. The cause is a JVM optimization in Sun's 1.6.0_4 JRE and later versions also. To identify this bug: start your JVM with -Xint to force the VM to run in interpreted mode (no hotspot compilation will occur). You should be able to a series of puts and gets for a sustained period (5-10 minutes ought to do it) without getting errant nulls back as values. Use 1.6.0_3 JRE or earlier, the optimization is not present in these JRE versions. Or you can use a .hotspot file to prevent compilation of the problematic method. See Sun's documentation for more detailed information on using a file to control the hotspot compilation. Add this to your java command line: -XX:CompileCommandFile=someFile.txt Then inside someFile.txt addthis single line: exclude com/gemstone/gemfire/internal/cache/tier/sockets/ConnectionImpl getObject
04/11/08 Missing CQ event (no HA) closed Missing CQ event during GII This could happen when events are getting destroyed when secondary buckets are getting created, the key may not be there as part of the GII, and if the same secondary becomes primary and event is re-routed the CQ processing doesn't find the value and CQ processing fails. In 6.5 change is made so that events are tracked/flushed during GII.
04/01/08 DistributedCacheOperation changes made on dev51 branch break replicate consistency 5.1 closed Possible inconsistency in regions with DataPolicy.REPLICATE When a new Region with DataPolicy.REPLICATE is created, it is possible that it will miss concurrent updates being applied to other Regions having the same data-policy. The window of time that this can occur in is miniscule, but it has been observed to happen in at least one of the v5.5 regression tests. The bug is fixed in GemFire 5.5. no workaround
03/31/08 RegionMembershipListener doesn't work for PR 6.0 closed Partitioned Regions do not fire RegionMembershipListener events If a RegionMemberShipListener is added to a Partitioned Region, the following methods do not fire for the listener: initialMembers afterRemoteRegionCreate afterRemoteRegionDeparture afterRemoteRegionCrash
03/27/08 Hitachi: HAClientQueue tries to participate in transaction, fails. 5.1 closed NullPointerException received during transaction commit on servers The configurations that could produce this exception are: 1) A client either a) registers interest in a region or b) creates a continuous query (aka CQ) with the region name in the query, both of which require the client property establishCallbackConnection=true 2) A server, to which the previously mentioned client is connected, performs an operation in a transaction that matches a) a region the client is interested in and b) matches the interest or CQ conditions the client has expressed. 3) The above transaction commits (versus rollback). The transaction can be initiated as a JTA transaction or a GemFire transaction. If the above configuration is met, the thread committing the transaction will receive a NullPointerException with a stack similar to the following: [severe 2008/03/25 16:39:49.471 PDS <Thread-4> nid=0x5f1ba8] CacheClientProxy[identity(client1(:loner):1:6364ecbb:ClientName1,connection=2); port=4623; primary=true]: Exception occurred while attempting to add message to queue java.lang.NullPointerException at com.gemstone.gemfire.internal.jta.TransactionImpl.registerSynchronization(TransactionImpl.java:197) at com.gemstone.gemfire.internal.cache.LocalRegion.getJTAEnlistedTX(LocalRegion.java:5173) at com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1098) at com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:188) at com.gemstone.gemfire.internal.cache.ha.HARegionQueue.put(HARegionQueue.java:386) at com.gemstone.gemfire.internal.cache.tier.sockets.CacheClientProxy$MessageDispatcher.enqueueMessage(CacheClientProxy.java:1724) at com.gemstone.gemfire.internal.cache.tier.sockets.CacheClientProxy.processMessage(CacheClientProxy.java:674) at com.gemstone.gemfire.internal.cache.tier.sockets.CacheClientNotifier.deliver(CacheClientNotifier.java:693) at com.gemstone.gemfire.internal.cache.tier.sockets.CacheClientNotifier.notifyClients(CacheClientNotifier.java:376) at com.gemstone.gemfire.internal.cache.BridgeServerImpl.notifyClients(BridgeServerImpl.java:257) at com.gemstone.gemfire.internal.cache.LocalRegion.notifyBridgeClients(LocalRegion.java:3750) at com.gemstone.gemfire.internal.cache.LocalRegion.invokePutCallbacks(LocalRegion.java:3716) If this occurs, the transaction will have been partially applied to the local heap. It will not, however, have been distributed to other VMs that would have received the transaction updates. The cause of this failure is the internal usage of regions to deliver to the client interest and continuous query data, particularly in the face of server failures (aka highly available or HA). Avoid transactions on a Bridge Server.
02/29/08 Many EOF errors in cache.tier.sockets.HandShake 5.5 closed Servers and clients may report each other's failures incorrectly If a server crashes unexpectedly, a client that it was connected to may report the failure in a number of misleading ways, including indicating a corrupted message stream from that process. Likewise, if a client crashes unexpectedly, a server that it was connected to may report the failure in a number of misleading ways, including indicating a corrupted message stream from that process. Ignore these messages in the log.
02/28/08 memberCrashed is invoked when a new endpoint is added to a BridgeWriter gridDev branch closed memberCrashed is invoked when a new endpoint is added to a BridgeWriter When BridgeWriter.addEndPoint() is invoked to add a new endpoint to the bridge writer the memberCrashed method is invoked on the BridgeMembershipListener with the new endpoint even if the new endpoint is actually available. It should invoke the memberJoined method as soon as the endpoint is actually live.
01/31/08 jmx test failure: could not getDistributedSystem during initialize -- Casued by java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.NameNotFoundException: jmxconnector 5.1 closed Gemfire agent command fails with RMI Naming errors on Windows Server with IPv6 enabled See Sun Microsystem's bug on this issue: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6301779 An Excerpt from that url: "The problem is caused by the fact that Java does not handle IPv6 link-local addresses correctly. The reason this problem is only seen on amd64 is to do with the IPv6 default setup on Windows 2003 Server - it maps link-local addresses to interfaces so that a call to InetAddress.getAllByName() on W2003S will return link-local addresses. (No link-local addresses are returned on XP)." In Gemfire this problem manifests as either a SocketBindException or a java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.NameNotFoundException: jmxconnector. workaround: Use JRE 1.6.0 or higher or Add a line like this to C:\WINDOWS\system32\drivers\etc\hosts fdf0:76cf::affd:9449:18 yourname.gemstone.com yourname Where "fdf0:76cf::affd:9449:18" is a global IPv6 address for the machine named "yourname" You only need to add this to the hosts file on the machine "yourname". You do not have to add an entry for "yourname" to each machine on your network.
01/30/08 PartitionedRegion tests can hang with threads in gemfire/internal/util/IdentityHash.index() [ IBM VM ] 5.1.0.4 closed entry operations on PartitionedRegions can hang in HashIdentity.index() during DataSerializer.write() with IBM 1.5 VM With IBM 1.5.0 VM, entry operations on PartitionRegions can hang in IdentityHash.index() during DataSerializer.write(). This is extremely rare and is a suspected JIT issue with the IBM VM.
01/28/08 Vestigial instances of Timer prevent WAR undeploy 5.0 closed gemfire.jar does not correctly undeploy from an EJB server Once an EJB application server is connected to a distributed system, it may not be able to correctly undeploy gemfire.jar. If possible, try to configure your application server so that it does not attempt to undeploy the GemFire application.
01/20/08 Issue in CountDownLatch.await while creating disk region in diskRegionRecoveryAfterVmCrash.conf test SVN revision 18405 closed Cache member hangs while creating a region Under certain high availability conditions, a cache member may hang while attempting to recover a region from another member that has crashed. In order for this to happen, a cache member needs to be creating a local copy of a region at the same time that another cache member crashes. Kill and restart the hung cache member.
01/14/08 user defined DataSerializer instances need client server support 5.0, 5.1 closed Newly registered DataSerializer not recognized on cache server and clients Registration of a DataSerializer on a node with GemFire's data serialization framework was only propagated to other peer servers. It did not get propagated to clients. If the registration was done on a client, it was not sent to the cache servers. Registrations are now propagated to all cache servers and clients.
01/11/08 Transactions encapsulating multiple regions fail LRU eviction on recipient members 5.1 closed Transactions that include multiple regions cause LRU problems in remote caches This problem occurs when a GemFire transaction includes many Regions, like this: txmgr = cache.getCacheTransactionManager() txmgr.begin(); region1.put("a", "one"); region2.put("b", "two"); region3.put("c", "three"); txmgr.commit(); and two or more of the regions have LRU eviction configured in VMs that are remote to the VM where the transaction originates. In this scenario, the LRU mechanism in the remote VMs does not consistently evict the proper number of entries. The problem does not affect eviction in the VM where the transaction originates. Only include a single region in a transaction or only have one region be configured with LRU behavior.
01/10/08 DLockTokens objects are not removed when the lock is released 5.1 closed DistributedLockService does not remove resources for tracking locks The DistributedLockService does not free up resources related to tracking locks. This also affects Global Regions, Partitioned Regions, and Gateway Hubs. Calls to DistributedLockService.freeResources(Object) does nothing, thus introducing a memory leak for each distributed lock that is acquired. The only workaround is to destroy the DistributedLockService. Destroying the DistributedLockService frees up all memory used to track locks. DistributedLockServices that are explicitly created and used must be destroyed to free up resources for all locks. For Global Regions, the Global region itself must be locally destroyed to free up all locking resources created for each key. For Partitioned Regions, the Cache must be closed to free up locking resources. For Gateway Hubs, the DistributedSystem must be disconnected to free up locking resources.
01/07/08 AssertionError: InitialImageOperation$RequestImageMessage <85> Did not finish sending message, but didn<92>t throw RegionDestroyed or CacheClosedException SVN revision 18253 closed Failed initial image creation may throw AssertionError If you close your cache while initializing the data in a distributed region, you may end up with a faulty AssertionError in the system logs. Ignore this assertion error. It is harmless.
11/25/07 PR regions do deserialization on remote bucket during get causing NoClassDefFoundError 5.1 closed Partitioned region puts throw NoClassDefFoundError on remote partitioned region members if the value class is not on the classpath A partitioned region put will fail with NoClassDefFoundError if the value Object's class is not on the classpath of every member that configures data storage for that partitioned region. The only members that should require the class are those that need the value in object form (for example the member that actually does a get to read the value or the member with a CacheListener that calls getNewValue). Add the value Object's class to all members that define the partitioned region.
11/21/07 memory leak when conserve-sockets false 5.0, 5.1 closed conserve-sockets=false may run out of sockets It is possible to see a member run out of sockets when using conserve-sockets=false. This can be caused by threads that own their own sockets having a short lifetime and new threads being created quickly that also own their own sockets. Call DistributedSystem.releaseThreadsSockets before a thread's life comes to an end. This can be done from a finally block on the thread's run method.
11/01/07 OutOfMemoryError Causes Distributed System Failure 5.1 closed Improper handling of instances of Java VirtualMachineError When a Java virtual machine sends an instance of VirtualMachineError to a thread, it has indicated that it has broken the fundamental programming contract and can no longer be trusted. The most common instance of this is OutOfMemoryError, which will be sent to <em>one</em> Thread somewhere in the JVM. All other Threads are effectively suspended at their next attempt to allocate memory until either a) enough memory becomes available, or b) the original thread that was signaled disappears. In prior versions, GemFire did not properly handle VirtualMachineErrors. This improper handling manifested in numerous bugs in the system. GemFire now has a cooperative mechanism by which a cache member can reliably recuse itself from the distributed system when a VirtualMachineError occurs. Notice, however, that in order for this to be reliable, your applications must also correctly trap and signal VirtualMachineError when they are thrown. See the Javadocs for SystemFailure for details on this new API.
10/25/07 data-polcy="partition' is insufficient, <partition-attributes/> is required to create PR 5.1 closed Partition region creation requires a partition-attributes element or a PartitionAttributes setting in the API Setting the region data-policy to 'PARTITION' should cause a region to be created as a partitioned region, but it doesn't. The data-policy setting is accurately reported, but this setting does not cause the region to partition its data. For the region to be created as a partitioned region, the region attributes must have a partition-attributes element in the cache.xml or a PartitionAttributes setting through the API. You do not need to set any non-default partition attributes settings, just use the partition attributes. In the xml, add a partition-attributes element to the definition of the region, even if the element is empty. In the API set the partition attributes through the region AttributesFactory setPartitionAttributes method, even if you just pass it a default PartitionAttributes instance.
10/05/07 throughput decreases as number of buckets increases GFD closed Partitioned Region read and write throughput decrease as buckets increases For a given partitioned region, the larger the value for the totalNumBuckets attribute (setTotalNumBuckets), the smaller the throughput for create and get operations. During testing with 100 VMs participating in the partitioned region, 50 which store data, 50 which do not (setLocalMaxMemory to 0), the most dramatic change occurred when the totalNumBuckets attribute exceeded 499 buckets. Use fewer than 499 buckets; however, only testing will truly indicate the proper values.
10/03/07 Installation Paths with spaces will cause the Native Client msi to error on some systems 5.1 GA closed Installation paths with spaces prevent the Native Client from installing correctly While installing the Native Client on Microsoft Windows, if a path is specified that contains spaces, for example "C:\Program Files\GemStone\GemFire", the msi installer that is invoked from setupWin32_gf51.exe will fail causing a dialog box that details the msi command line syntax to appear. After dismissing this dialog the installation will continue and appear to have succeeded. The Gemfire installation itself is OK, but the Native Client installation is not: only the native_client.msi and a few html files are installed for the Native Client. Uninstall the product to clean up the system from the failed install. Then reinstall the product into a path without spaces.
10/02/07 partitioned region buckets are not balanced 5.1 closed Partitioned Region data storage is skewed When quickly loading data into a partitioned region, the number of buckets from one data store to the next may vary as much as 100%. Due to the seemingly random allocation of buckets this requires that all VMs for the partitioned region have up to two times the required memory for actual storage. Increasing the maximum number of buckets exaggerates the problem. There are two ways to potentially work around this problem: 1) Artificially slow the rate of data loaded into a partitioned region. 2) Using the PartitionedRegionStats bucketCount to determine an imbalanced system (from VM to VM), for each VM with the worst imbalance, introduce a new VM to the partitioned region and then shutdown the offending VM.
10/01/07 member wrongly evicted by failure-detection does recognize membership changes 5.1 closed Member that is kicked out of the distributed system may not realize it and continue to operate, eventually causing hangs If you are using the gemfire.useFD or gemfire.FD_TIMEOUT system properties to select the alternative GemFire UDP-heartbeat failure detection mechanism, a member can be forcibly disconnected from the distributed system if it does not respond quickly enough to "are you alive" messages. The member-timeout and gemfire.FD_TIMEOUT settings control this disconnect timeout. In version 5.1.x of GemFire, the disconnected member does not realize that it has been kicked out of the system and continues to try to operate. Eventually other members may hang. We have only observed this with the alternate failure detection mechanism and only under significant CPU load. However, setting a short member-timeout period may exacerbate the problem and cause it to happen more easily. Set a reasonably long member-timeout period when using gemfire.useFD, or set the timeout period with the deprecated gemfire.FD_TIMEOUT system property.
10/01/07 Region recovery from disk fails with "DiskAccessException: Failed loading keys from <diskReg dirs>, Caused by: java.io.EOFException 5.1 (gfecq_branch) closed Region recovery from disk fails with "java.lang.Error: CRITICAL: page header magic for block *** not OK 0" When switching out files for repair, this exception may disrupt recovery from disk. Switching is done when a JDBM exception has been encountered at least once already.
09/25/07 Region destroy/close does not close LRUStatistics 5.0, 5.1 closed Eviction regions with short life span have unexpected memory and cpu consumption Region close, localDestroyRegion, and destroyRegion on a region with eviction configured will not close the LRUStatistics object. If a large number of region destroys are done, this can cause the statistic sampler to consume an entire CPU and the unclosed statistic object to consume around 100 bytes of memory. To prevent the memory leak avoid giving your LRU regions a short life span. This can be done by using region clear instead of doing a destroy/create. To prevent the CPU consumption, you can disable statistic sampling.
09/24/07 Unusually high number of eviction failures in trunk build 132 5.1 closed LRU Region eviction may happen early or late The LRU limit is not strictly complied with when doing evictions. Evictions might be done slightly early (causing less space to be used than was specified) or slightly late (causing more space to be used than was specified). You can set -Dgemfire.STRIPED_STATS_DISABLED=true to get the older version of statistics that causes strict compliance to the eviction limit.
09/23/07 Hang in waitForRegionCreateEvent of newly restarted VM during shutdown 5.1 closed Hang during Cache close In a client/serve high-availability test that repeatedly destroyed and created Regions and Caches in multiple VMs, we experienced a hang in a server VM. The server was in the process of exiting, and the GemFire shutdown hook was attempting to close the Cache. A stack dump (kill -QUIT) showed the hung thread was waiting on initialization of a Region, but no other threads were involved with creating a Region. "vm_3_thr_3_bridge1_hs20c_11833" daemon prio=1 tid=0x085ab338 nid=0x2d24 in Object.wait() [0x5f0ce000..0x5f0ce5f0] at java.lang.Object.wait(Native Method) - waiting on <0x58cc26d8> (a com.gemstone.bp.edu.emory.mathcs.backport.java.util.concurrent.CountDownLatch) at java.lang.Object.wait(Object.java:432) at com.gemstone.bp.edu.emory.mathcs.backport.java.util.concurrent.TimeUnit.timedWait(TimeUnit.java:364) at com.gemstone.bp.edu.emory.mathcs.backport.java.util.concurrent.CountDownLatch.await(CountDownLatch.java:234) - locked <0x58cc26d8> (a com.gemstone.bp.edu.emory.mathcs.backport.java.util.concurrent.CountDownLatch) at com.gemstone.gemfire.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:53) at com.gemstone.gemfire.internal.cache.LocalRegion.waitOnInitialization(LocalRegion.java:3029) at com.gemstone.gemfire.internal.cache.LocalRegion.waitForRegionCreateEvent(LocalRegion.java:1633) at com.gemstone.gemfire.internal.cache.LocalRegion.dispatchEvent(LocalRegion.java:5290) at com.gemstone.gemfire.internal.cache.LocalRegion.dispatchListenerEvent(LocalRegion.java:4240) at com.gemstone.gemfire.internal.cache.LocalRegion.sendPendingRegionDestroyEvents(LocalRegion.java:4476) at com.gemstone.gemfire.internal.cache.LocalRegion.basicDestroyRegion(LocalRegion.java:3868) at com.gemstone.gemfire.internal.cache.DistributedRegion.basicDestroyRegion(DistributedRegion.java:1250) at com.gemstone.gemfire.internal.cache.LocalRegion.handleCacheClose(LocalRegion.java:4515) at com.gemstone.gemfire.internal.cache.DistributedRegion.handleCacheClose(DistributedRegion.java:1700) at com.gemstone.gemfire.internal.cache.GemFireCache.close(GemFireCache.java:581) - locked <0x470b5ef8> (a java.lang.Class) - locked <0x4b1cfe68> (a com.gemstone.gemfire.internal.cache.GemFireCache) at com.gemstone.gemfire.distributed.internal.InternalDistributedSystem.doDisconnects(InternalDistributedSystem.java:773) at com.gemstone.gemfire.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:904) at com.gemstone.gemfire.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:668) at com.gemstone.gemfire.distributed.DistributedSystem.disconnect(DistributedSystem.java:960) at hydra.RemoteTestModule$2.run(RemoteTestModule.java:372) No workaround
09/21/07 sudden heap growth in multicast smoke performance test 5.0 closed Multicast retransmissions cause a slow memory leak When using distribution scopes of DISTRIBUTED_ACK or GLOBAL with multicast-enabled=true, it is possible (though unlikely) that a VM will experience a memory leak. The leak is caused by multicast retransmission logic and can cause the VM to run out of heap space. Change your configuration to use TCP instead of multicast
09/18/07 local scope persistent regions do not allow register interest 5.0 closed CacheWriterException thrown from registerInterest on local persistent replicates When the registerInterest method is called on a region with local scope and persistence enabled it will always throw a CacheWriterException with the message "Interest registration not supported on replicated regions".
09/16/07 Assertion: Commit data for TXLockId not found; expected values not distributed to all peers 5.1 closed Severe log messages indicating transaction failures A VM configured with conserve-sockets=false which originates a transaction may cause severe log messages in a receiving VM similar to the following: Uncaught exception processing CommitProcessForLockIdMessage@17373340 lockId=TXLockId: newton(18461):40211/45363-2 java.lang.AssertionError: Commit data for TXLockId: TXLockId: newton(18461):4021 An indicator of problem on the sending VM is the occurrence of warning messages starting with the text: "Attempting TCP/IP reconnect to" Regardless of the conserve-sockets setting, this failure should not occur when the transaction contains only Scope.DISTRIBUTED_NO_ACK regions. Avoid mixing transactions and conserve-sockets false in the same VM.
09/05/07 PR put fails with AssertionError 5.1 closed Calling getRegion on RegionExistsException returns partially initialized region. If you are creating root regions, catching RegionExistsException and then calling the getRegion method on the RegionExistsException the region returned may not yet be initialized. The workaround is to do this before you use the region returned by getRegion() import com.gemstone.gemfire.internal.cache.LocalRegion; catch (RegionExistsException ex) { LocalRegion lr = (LocalRegion)ex.getRegion(); lr.waitOnInitialization(); // it is now ok to use the region returned by getRegion
09/05/07 DistributedSystem.connect() fails to return existing system 5.0 closed DistributedSystem.connect() fails to return existing system Calling DistributedSystem.connect() can result in the exception java.lang.IllegalStateException: A connection to a distributed system already exists in this VM. It has the following configuration: followed by the configuration. This bug is caused by the mcast-flow-control setting not being properly handled when comparing the properties passed to the connect method (or provided in gemfire.properties) with the properties already held in existing system(s). No workaround except to remove the mcast-flow-control setting from the properties. This bug is fixed in GemFire v5.1.
08/31/07 split-brain in partitioned region: same partitioned region with multiple prId identifiers 6.0 closed Split brain in partitioned regions There is a rare race condition that can occur in assigning an internal identifier to a partitioned region. The condition causes the system to assign more than one identifier to a single partitioned region, with some processes using one identifier and some using another. Because of this, the processes with one identifier do not recognize operations performed on the Region by the processes using the other identifier and vice-versa. We have not been able to isolate the cause of this race condition. It occurs very rarely and appears to happen when many processes attempt to initialize at the same time. We have added a distributed consistency check that verifies that the correct internal identifier is being used. If the consistency check fails, you will see a warning message in one of two forms: node(processID)memberID is using PRID 1 for regionName but this process maps that PRID to 2 node(processId)memberID is using PRID 1 for regionName but this process is using PRID 2
08/15/07 ArrayIndexOutOfBoundsException when log-disk-space-limit is set 5.0.1 closed ArrayIndexOutOfBoundsException when log rolling enabled and a log-disk-space-limit configured When log rolling is enabled and a log-disk-space-limit is configured then the code that checks the disk space limit may throw an ArrayIndexOutOfBoundsException. An example stack follows: Caused by: java.lang.ArrayIndexOutOfBoundsException: 3 at com.gemstone.gemfire.internal.ManagerLogWriter.checkDiskSpace(ManagerLogWriter.java:440) at com.gemstone.gemfire.internal.ManagerLogWriter.checkDiskSpace(ManagerLogWriter.java:452) at com.gemstone.gemfire.internal.ManagerLogWriter.switchLogs(ManagerLogWriter.java:213) at com.gemstone.gemfire.internal.ManagerLogWriter.rollLog(ManagerLogWriter.java:457) at com.gemstone.gemfire.internal.ManagerLogWriter.put(ManagerLogWriter.java:496) at com. Since rolling logs also leaks file descriptors you should disable log rolling in 5.0.1 by setting log-file-size-limit to zero. If you are willing to live with the file descriptor leak then you can work around this ArrayIndexOutOfBoundsException by setting log-disk-space-limit to zero.
07/25/07 put from client into PR region fails with IMQException returned from cacheserver pr Feb branch closed IMQException while doing put in PR. put from client into PR region fails with IMQException returned from cacheserver. Reason: For a query to work correctly it has to have a real Java object (POJO) to work with, this poses an interesting situation for any kind of remote query, that is a query sent from one VM to another. The issue arises when the remote VM, whose object storage may be in serialized form (true for Partitioned Regions as well as Cache/Bridge Servers) needs to de-serialize the stored object into a POJO. If the class can$1t be de-serialized, then the query fails. So the user needs to know the steps to allow for successful de-serialization to avoid the problem described in this bug.
06/21/07 For bucket id 26, expected 2 members in primary list, but found 3 prFeb07 closed Partitioned Region meta data may contain incorrect information after VM failures For a given Partitioned Region, if participating VMs have failed either through network problems, hardware failures, or software crashes, Partitioned Region meta data for a given participant may contain incorrect information for one or more buckets. The result of such incorrect information is potentially slower access to the information in that bucket. The higher the redundantCopies setting the greater the potential to become incorrect. The redundantCopies setting 0 does not suffer from this issue.
06/20/07 non-zero log-file-size-limit causes file descriptors to leak 5.0 closed Non-zero log-file-size-limit causes file descriptors to leak Configuring gemfire to roll log files by specifing log-file-size-limit to something other than 0 can result in a leak of a file descriptor every time gemfire rolls the log file.
06/14/07 CacheTransactionManager can refer to a closed DistributedSystem all closed Cache Transaction Manager may refer to closed distributed system If you close your distributed system and then create a new one, your cache transaction manager may attempt to use the old (closed) distributed system. Transactions may fail or erroneously appear to succeed. If you use transactions, do not close your distributed system after creating it. Exit the JVM if you need to create a new distributed system.
06/05/07 poor get performance for partition region prFeb07_branch closed Partitioned Region get performance degraded Performing a get() on a Partitioned Region is 3x worse than release 5.0.1.
04/25/07 GemFire transaction svc doesn't do proper write-write conflict detection 5.0.1 closed write-write conflicts not always detected If a key is read in one transaction, another transaction modifies the key, and finally the first transaction modifies the key, the conflict is not detected and transaction is committed.
04/10/07 Java-level deadlock in InternalDistributedLockService.checkLockGrantorInfo leads to stuck lock and hung message reader thread 5.0 closed Java deadlock in DistributedLockService can lead to stuck lock and hung message reader Destroying a DistributedLockService while there are pending lock requests still active can result in those pending locks becoming stuck and unavailable system-wide until the VM that requested such a lock disconnects. In addition, the VM may quit processing messages sent by the member from which it was acquiring the lock remotely. This affects all features that use DLS. For example, Global Regions must lock the key in order to put or destroy the cache entry. Any calls to do so for a key that has a stuck lock would then hang until the VM that caused the problem disconnects from the system. In general, when you close or destroy a feature that uses a DistributedLockService, then that DistributedLockService is destroyed. The workaround is to destroy the DLS when there are no threads actively trying to acquire locks.
03/24/07 GemFireCache.close is not thread safe 5.0.1 closed GemFireCache.close is not thread safe If one thread attempts to create a new cache while another thread is closing the old cache, one or more static resources may be nulled out, left in an unknown or incorrect state, or never cleaned up. Use the same thread to close the old cache and create the new cache.
03/20/07 PR-HA test hangs while waiting to connect to killed VM 5.0.1 closed System deadlock during conditions of extreme membership volatility Under certain conditions with volatile membership changes (cache members departing under busy conditions), there is a potential for system deadlock. The confused cache member will have a message similar to the following in its logs: [warning 2007/03/19 22:14:47.635 PDT gemfire3_huey_22603 <vm_7_thr_9_client3_huey_22603> nid=0x1a] Error sending message to huey(22596):56886/48525 (will reattempt): java.net.ConnectException: Connection refused The best solution is to avoid conditions of extreme membership volatility (cache members arriving and departing with great frequency). If this condition is detected in a running system, the deadlock can be safely broken by killing the hung cache member.
03/16/07 StateFlushOperation may hang with Global scoped regions 5.0.1 closed New replicate in region with global scope can cause system hang If a region has global scope, it is possible for a new replicate to cause a hang in the distributed system. Operations on regions with global scope are not performed in token mode but are put in the waiting thread pool until the region they're modifying is done with getInitialImage. StateFlushOperation will invade other VMs and wait for these messages to finish being processed before allowing the getInitialImage to complete. Not applicable.
03/14/07 BridgeClient receives BridgeWriterException: InterruptedException on region.get() with server in the process of shutting down (due to InterruptedException/shutdown in progress issues) 5.0.1 closed Cache member shutdown is not reliable Under certain circumstances, especially if there are outstanding operations in a cache member, there is a possibility that the cache member will hang (not completely exit) during shutdown processing. If a cache member does not completely exit, it is safe to directly kill its process using operating system tools (kill -9 in Solaris or Linux, or the task manager in Windows).
03/02/07 Query shortcut on Region doesn't use index 5.0.1 closed Region.query shortcut method does not use Indexes The query shortcut method in the Region interface does not make use of indexing. Also the QueryService Query instances do not use indexing if the region is passed in as a parameter to the query. Use Query instances obtained from the QueryService and reference regions by full path rather than by passing them in as parameters.
03/02/07 NPE reported from GrantorRequestProcessor.startElderCall() 5.0.1 closed NPE reported from DLockRequestProcessor The NullPointerException is caused by an assertion error. Lock grants that arrive after the lock service is destroyed must be released to prevent a stuck lock. This NPE causes the associated lock to remain stuck until the VM's Distributed System connection closes. This is the error output to the logs: [severe 2007/03/02 12:45:43.536 PST gemfire5_newton_24123 nid=0x75407bb0] Uncaught exception processing DLockRequestProcessor.DLockResponseMessage responding GRANT; serviceName=Partitioned Region Lock Service; objectName=#partitionedRegion; responseCode=0; keyIfFailed=null; leaseExpireTime=9223372036854775807; processorId=807; lockId=807 java.lang.NullPointerException at com.gemstone.gemfire.distributed.internal.locks.GrantorRequestProcessor.startElderCall(GrantorRequestProcessor.java:209)
02/27/07 List of departed members grows without bound inside of VMs 5.0 closed Frequent cache membership changes uses memory, degrades performance When a system member leaves a distributed system, they are placed in a departed member list by the remaining system members. This list is not cleared out and thus grows without bound. The list uses a certain amount of extra memory, but--more importantly--as successive members depart the distributed system, the amount of processing time associated with handling the departures increases. If the membership of your distributed system is rather stable (a small number of departures), no workaround is required. If, however, your configuration requires a large number of cache members to join and depart, you need to restart any long-lived cache members on a periodic basis to prevent performance degradation or possibly even memory exhaustion.
02/16/07 ValueConstraint will causes all objects to be deserialized 5.0 closed Setting a value constraint for a region's values causes all objects to be deserialized The ValueConstraint region attribute allows you to declare the class of all the values for a region. But if you specify a constraint, then every value in the region must be deserialized to check the constraint. none/not applicable
02/16/07 Slow gateway shutdown can leave cache open 5.0 closed Slow gateway shutdown can produce CacheExistsException If you close and reopen a cache that has a gateway, on rare occasions this produces a CacheExistsException. Stop the VM and restart it.
02/14/07 PartitionedRegionException: registerPartitionedRegion: /PartitionedRegion_9 caught exception dumpPRId:prIdToPR Map@18550851: caused by java.lang.InternalError: Got RegionExistsException 5.01 closed Creation of a PartitionedRegion may fail with exception "registerPartitionedRegion: /PartitionedRegion_9 caught exception dumpPRId:prIdToPR" During concurrent creation and destruction of a partitioned region with a specific name, it is possible for a PartitionedRegionException to be thrown during createRegion with the message "registerPartitionedRegion: /PartitionedRegion_9 caught exception dumpPRId:prIdToPR". Catch this exception and re-create the region.
02/08/07 Bridge hangs on close waiting for a GrantorRequest response from a member that has departed the DS 5.01 closed Cache server shutdown can cause a system-wide hang On rare occasions, a cache server can experience a problem during shutdown that causes a system-wide hang. This situation happens when the server tries to shut down while it is waiting on a response from another member that has left the distributed system. The server logs a message of this type: [severe ... ] While pushing message <message> to <recipients> com.gemstone.gemfire.ThreadInterruptedException: sleep interrupted Caused by: sleep interrupted, caused by java.lang.InterruptedException: sleep interrupted This problem does not cause data corruption, and the distributed system will restart successfully. Kill your processes and restart all your system members according to your usual procedures. none/not applicable
02/07/07 Multiple ServerMonitors with same-named endpoints can cause recursive endpoint died/recovered cycle 5.0.1 closed Multiple server definitions with the same name and port can cause a client to enter an endless loop This problem only affects clients running on very fast systems. On fast systems, if any two instances of BridgeLoader or BridgeWriter define the same server name and port pair, a loss of server connection can send the client's server health monitor into an endless loop. The health monitor maintains the client's live and dead server lists. When the client enters into this loop, it appears as if the servers are going up and down. Define each server name and port pair exactly once for any client VM. This means that the BridgeLoader and BridgeWriter for a single region must use different names for the same server endpoint. It also means that you mustn't create multiple instances of a single BridgeWriter or BridgeLoader definition. Starting with version 4.3, the API automatically manages reuse of the same loader and writer instances when the definitions are the same, so no explicit action is required on your part. This example shows how to avoid defining the BridgeLoader and BridgeWriter with the same name and port pairs: Properties writerProps = new Properties(); writerProps.setProperty("endpoints", "serverWA=localhost:44441,serverWB=localhost:44442"); BridgeWriter bWriter = new BridgeWriter(); bWriter.init(writerProps); Properties loaderProps = new Properties(); loaderProps.setProperty("endpoints", "serverLA=localhost:44441,serverLB=localhost:44442"); BridgeLoader bLoader = new BridgeLoader(); bLoader.init(loaderProps); This problem is fixed in version 5.0.1.
01/31/07 A bridge client putting an empty byte array causes a server NullPointerException 5.0.1 closed Empty byte[ ] causes exception in client/server topology In a client/server topology, you can't put an empty byte[] into the cache as a value. You can have an empty byte[] key. A client attempting to put an empty byte[] into the cache causes the following exception on the server: [java] [warning 2007/01/31 13:27:22.285 PST "server" <ServerConnection 0.0.0.0/0.0.0.0:44444 Thread 12> nid=0x1e4a47e] Server connection from [identity(bishop(:loner):1:0d64f66b,connection=2); port=52631]: Unexpected Exception [java] java.lang.NullPointerException [java] at com.gemstone.gemfire.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:632) . . . One workaround would be to store a one-element byte[] as the value.
01/31/07 hang during shutdown in TimeScheduler 5.0 closed Hang during DistributedSystem disconnect in TimeScheduler Under very rare circumstances, the DistributedSystem in GemFire may hang when it is being disconnected. Symptoms of this problem are * The thread that is disconnecting the DistributedSystem will be in com.gemstone.org.jgroups.util.TimerScheduler.stop. * A thread named TimeScheduler.Thread will be in this state: Object.wait() [0xffffffff5bbff000..0xffffffff5bbff828] at java.lang.Object.wait(Native Method) - waiting on <0xffffffff6558e7e0> (a com.gemstone.org.jgroups.util.TimeScheduler$TaskQueue) The hang will not affect other processes. The hung VM should be terminated manually. If you encounter this defect, please contact GemStone Technical Support. No workaround
01/18/07 SystemConnectException: Received no connection acknowledgements from any one of the 1 senior cache members, but both members have each other in their view 5.0 closed Timeout receiving startup responses When a cache member joins an existing distributed system, it must receive an acknowledgment from at least one senior member of the system. If it fails to receive a response in a timely manner, the cache member's startup will fail with a message similar to this: Received no connection acknowledgments from any of the 1 senior cache members: This is usually an indicator of a grossly overloaded system that will not perform satisfactorily in a production environment. If it is not possible to reconfigure your system to allow cache members to respond more quickly, tune the system property DistributionManager.STARTUP_TIMEOUT which controls the amount of time a cache member waits for replies. The default value is 15000 ms (15 seconds), and raising this value may alleviate this symptom.
01/15/07 Unable to start a cacheserver on Win2003 64-bit edition 5.0 closed GemFire batch files do not execute correctly on 64bit versions of Windows The origin of this problem is that DOS evaluates variables as it reads the line, so set PATH=someString (x86);%PATH% is expanded to set PATH=someString (x86); ACTUAL VALUE OF THE PATH VARIABLE Because of the parentheses, the expression is further expanded into two separate commands, like this set PATH=someString (x86); ACTUAL VALUE OF THE PATH VARIABLE The first line executes correctly and the second causes an error. For GemFire Enterprise, a path containing parentheses "()" breaks the setenv.bat script, leaving the PATH without the gemfire.dll. This forces the GemFire application into Pure Java mode. This problem is not limited to 64bit Windows it is just more reproducible because WOW64 replaces paths like c:\windows\system32 with "C:\windows\system32 (x86)", causing more errors than might be caused by regularly specified paths. Avoid references to paths such as "C:\Program Files" and "C:\windows\system32" in your PATH environment variable.
01/12/07 getElderState hangs waiting for reply from remote VM which appears hung in getGrantorForRemoteElderRecovery 5.0 closed A VM departure with multiple global regions or lock services can cause a system-wide hang If you have more than one global region or more than one instance of DistributedLockService in your distributed system, on rare occasions a VM departure can cause a system-wide hang. The hang affects all VMs that use either the DistributedLockService or any features that rely on the DistributedLockService, such as global regions, partitioned regions, and transactions. none/not applicable
01/11/07 Clients can not use registerInterest on regions with DataPolicy EMPTY 5.0 closed IllegalStateException when client calls registerInterest on a region with data-policy of empty If a bridge client tries to register interest on a region whose data policy is empty, the call returns an IllegalStateException saying 'No mirror type corresponds to data policy "EMPTY".' This error message refers to the deprecated mirror-type region attribute, which has been subsumed by the data-policy attribute. The fundamental bug in this case is that the product does not allow you to register interest in a region with data policy set to empty. There is no workaround in this version of the product. If you need to use an empty data policy and register interest in a client region, upgrade to GemFire Enterprise version 5.0.1.
12/18/06 Unable to install GFE 5.0 on Windows Vista 5.0 closed Unable to install GFE 5.0 on unsupported platform The GemFire Enterprise installer provides product installation only for the supported platforms. Generally, to install and try the product on an unsupported platform, you should contact the GemStone technical support for a .zip file. If you want to install on Windows Vista, you can install on an XP machine and then copy the product tree to the Vista machine.
12/11/06 hang in parRegCreateDestroy waiting for replies 5.0RC1 closed Heavily loaded systems may cause membership failure If a TCP/IP connection between two cache members is disrupted by extremely heavy system loading, it is possible for one or more members of the distributed system to incorrectly assume that a peer has departed the system. This leads to an inconsistent accounting--between cache members--of the currently active members of the system. This in turn can lead to cache corruption or system deadlocks. The level of loading required to generate this type of failure is huge. For instance, one test case in-house had a CPU load of 40 (many cache members on a single underpowered host) running for 15 minutes before this failure reproduced. Users should be careful to monitor processor utilization on the hosts running GemFire cache members and to avoid extreme overloading.
12/05/06 Inconsistent PR data, too many bucket owners 5.0RC1 closed IO Exceptions can cause data loss when Partitioned Region redundantCopies=0 This problem occurs when there are no redundant copies in a PartitionedRegion. Under some failure conditions during communication it is possible for data loss to occur. These include the following failure types: [warning ... ] Ran out of thread owned resources so switching to conserve-sockets=true. Because: com.gemstone.gemfire.internal.tcp.ConnectExceptions: Could not connect to: somehost(15188):2243/2165 Causes: {java.io.IOException: An existing connection was forcibly closed by the remote host} [warning ... ] Failed sending {com.gemstone.gemfire.internal.cache.UpdateOperation$UpdateMessage(region path='/__PRRoot/__Bucket2NodeRegion_#partitionedRegion'; sender=somehost(16924):2225/2162; callbackArg=null; processorId=0; op=CREATE; appliedOperation=false; earlyAck=false; directAck=true; lastModified=0101010101010; key=105; newValue=null; valueIsSerialized=true)} to member {somehost(11188):2210/2160} with stub {tcp:///192.168.1.1:2160} who is now considered to have crashed because: com.gemstone.gemfire.internal.tcp.ConnectionException: Not connected to tcp:///192.168.1.1:2160 [warning ... ] Error sending message to somehost(16924):2225/2162: java.io.IOException: An established connection was aborted by the software in your host machine blished connection was aborted by the software in your host machine Configure your application to allow for data loss such that the storage of record can be accessed via a CacheLoader. Restart all members reporting such warnings in addition to those members referred to in the warning messages.
11/30/06 Internal PartitionedRegionException is thrown from public API 5.0 closed Internal PartitionedRegionException is thrown from public API Some partitioned region operations throw product internal exceptions, such as com.gemstone.gemfire.internal.cache.PartitionedRegionException. Typically these exceptions indicate internal problems with the product. If they do occur, please contact support with the exception, all associated logs and statistic files.
11/20/06 inconsistent bucket stores in partitioned region with redundancy=1 5.0 closed Inconsistent bucket stores in partitioned region with redundancy=1 There is a race condition in the propagation of entry operations in partitioned regions that can cause inconsistent data, resulting in the order of operations being mixed. For any given entry operations at any given time, ensure that there is only one writing thread. One way to accomplish this is to use the DLock system to order operations.
11/09/06 JMX tests fail with OOM with 3.0.2 libraries for MX4J (and 1.4.2 JRE) 5.0 closed JMX Agent unstable in GemFire version 5.0 The JMX Agent is unstable in GemFire 5.0. GemFire 5.0 uses MX4J 3.0.1 (for both JDK 1.4 and 1.5) which has serious bugs causing OutOfMemory errors. The main errors that you might see occur during method invocation on MBeans that are hosted in the GemFire JMX agent. The errors are java.lang.OutOfMemoryErrors wrapped inside javax.management.MBeanExceptions. The 5.0 agent should not be used in production systems, but may be used for development or testing purposes. There is no suitable workaround in 5.0. We recommend upgrading to version 5.0.1 to resolve this problem. The 5.0.1 version of GemFire uses JDK 1.5 JMX for the 1.5 JDK and MX4J 2.0.1 for the 1.4 JDK.
10/12/06 Uncaught InterruptedException in ServerConnection thread (ThreadPoolExecutor) 5.0 closed Uncaught InterruptedException in ServerConnection thread (ThreadPoolExecutor) During bridge server shutdown, the ServerConnection ThreadPoolExecutor may log a message of this type: [severe 2006/10/11 23:38:59.214 PDT gfserver1 nid=0x9c1a1bb0] Uncaught exception in thread java.lang.InterruptedException ... This is a small bug in shutdown handling that has no negative effect on VM health or behavior. You can safely ignore the message.
10/05/06 Restarted VM fails to createVMRegion due to PartitionedRegionException: Could not get Partitioned Region from Id 2 5.0 closed Creation of a PartitionedRegion may fail with exception "Could not get Partitioned Region from Id" During creation of a Partitioned Region, an identifier is created. There are conditions under which the identifier creation/discovery process fails for a given VM. This failure causes a PartitionedRegionException to be thrown during Region creation. Typically the cause of such a failure is related to distributed race conditions. Catch the exception and retry the creation operation.
09/07/06 Unexpected keys found in partitionedRegion (region size is greater than expected) 5.0 closed Unexpected keys found in partitioned region (region entry count is greater than expected) This kind of data inconsistency can happen when concurrent destroy and create/put operations are performed on an entry by multiple threads. The threads can be in any number of VMs. Either use redundantCopies=0, or if that is not possible, prevent concurrent entry operations (put, invalidate, destroy) on a per-entry basis. If the writing operations can be limited to a single VM, use synchronization to coordinate threads in that VM. If the writing threads must be distributed among multiple VMs, use the DLock system to coordinate entry write operations.

Bugs with workaround:

Created Summary Ver Status Bugnote title Bugnote description Workaround
08/05/10 primary queues leak thread identifiers assigned clients that cycle many threads can cause server memory leak Long-lived subscription-enabled clients that cycle many threads cause a memory leak on the server that contains their primary queue. Reuse threads on the client if possible.
05/25/10 member fails to connect to DS with SystemConnectException (locator reports missing ACKs from member + member already present) 6.0 assigned SystemConnectException thrown during connect after coordinator reports missing view acknowledgements Under highly volatile membership conditions we have seen cases where a new member was unable to join the distributed system when it ought to have been able to. The membership coordinator reports missing view acknowledgements from one or more processes, including the new member. If the coordinator accepted the new member and sent out a new membership view, it is reasonable to expect that the new member would recognize that it had joined. It is also reasonable to expect that the new member would acknowledge the new membership view. Reattempt connecting to the distributed system.
07/21/09 Getting java.lang.NoClassDefFoundError: javax/activation/DataSource in the agent.log file. 6.0 assigned For Java 1.5, Email alerts in GemFire Agent requires activation.jar from JavaBeans Activation Framework The GemFire Agent requires activation.jar to be in its classpath to send email alerts. This happens only with JRE version 1.5 and older. The jar is distributed by SUN as part of the JavaBeans Activation Framework. It can downloaded from the SUN website at http://java.sun.com/javase/technologies/desktop/javabeans/jaf/downloads/index.html The JavaBeans Activation Framework is distributed as part of Java SE 6 release. For Java 1.5, add JavaBeans Activation Framework jars in classpath before starting the GemFire Agent.
04/21/09 Unable to allocate an available vm to host a bucket when enforcing unique hosts 6.0 assigned The EnforceUniqueHostStorageAllocation feature requires no two systems share IpAddresses Using the EnforceUniqueHostStorageAllocation feature requires that no two systems hosting members in a DistributedSystem share the same IpAddress. This is true even if the network adapter is in a "DOWN" state. The exceptions to this rule are the loopback address and the "is any" address (aka 127.0.0.1 and 0.0.0.0 respectively). The symptom when two members do share an IpAddress and the EnforceUniqueHostStorageAllocation system property is set to "true" is a message in the logs similar to the following: system.log: [warning 2009/04/21 10:00:41.290 PDT gemfire1_10503 <thread 1> tid=0x79] Unable to find sufficient members to host a bucket in the partitioned region. Region name = /partitionedRegion Current number of available data stores: 10 number successfully allocated = 3 number needed = 4 Data stores available: [ptestg(13629):58399/50210, lewis(10584):42395/52373, ptestg(13632):58401/50211, ptesth(8852):57714/32881, king(10497):37041/62411, lewis(10582):42398/52374, king(10501):37037/62412, ptesth(8850):57715/32882, king(10499):37039/62407, king(10503):37044/62414] Data stores successfully allocated: [king(10497):37041/62411, lewis(10582):42398/52374, ptesth(8850):57715/32882] Consider starting another member Remove duplicate IP addresses.
03/20/09 GemFire clients on disconnected system hang instead of being forcefully disconnected as expected 6.0 assigned Network Partition detection may fail with physical network disruption Network Partition detection may fail when the network cable is pulled. Enabling loopback causes messages sent by a GemFire process to itself to be delivered through a queue so that they are not lost if network hardware fails. This allows GemFire to detect the loss of connectivity and react accordingly. In versions prior to GemFire 6.0, run with the property p2p.ENABLE_LOOPBACK=true. In GemFire 6.0 and later, this property is enabled automatically. In all cases, the GemFire JAR and the JRE must be available to the VMs running on each host for network partition detection to function properly.
12/25/08 NoSubscriptionServersAvailableException: Could not initialize a primary queue on startup. No queue servers available. 6.0 assigned NoSubscriptionServersAvailableException while creating a client with security One some platforms calling getCredentials on the provided PKCSAuthInit template can be slow the first time it is called. This can cause a timeout on the server while creating a connection, resulting in a NoSubscriptionServersAvailableException on the client. Set the system property BridgeServer.acceptTimeout to something higher. The default is 9900 milliseconds.
09/03/08 hangs with JRockit 1.6.0_3 with threads waiting for locks that don't appear to be held assigned Threads hang while blocking for synchronization in JRockit On Java SE 6 versions of JRockit JVM, one or more threads appear to hang while blocking for a synchronization that is not held by any other thread. We have found that this problem can be avoided by disabling lazyUnlocking using: -XXlazyUnlocking:enable=false According to the JRockit documentation: "In R27.5 lazy unlocking is enabled by default in Java SE 6 versions of JRockit JVM on all platforms except IA64 and with all garbage collection modes except the deterministic garbage collection mode." Disabling JRockit's lazyUnlocking seems to prevent these hangs.
08/07/08 Admin API connection shows up as a non-Admin member 3.5 assigned Admin API connection shows up as a non-Admin member When a dedicated Admin connection is created using AdminDistributedSystemFactory, the member will appear as a non-Admin member to all members of the distributed system. Calls to AdminDistributedSystem.getSystemMemberApplications will include the local Admin member itself. AdminDistributedSystemFactory.setEnableAdministrationOnly(boolean) can be used to set it for Administration only which will ensure that AdminDistributedSystem.getSystemMemberApplications does not return the dedicated Admin connection.
06/24/08 Hydra timeout using global regions with vm waiting for dlock assigned Lease expiration causes locking to hang Lease expiration can cause all other lock requests on the DistributedLockService to hang. Global Region operations may hang for the same reasons. Use -1 for lock-lease to prevent lease expiration
04/16/08 Heap LRU EvictionAttributes EvictorThread uses early escape Region reference 5.5 assigned Heap LRU EvictionAttributes EvictorThread may cause severe errors in logs When configuring a Region using one of the EvictionAttributes.createLRUHeapAttributes methods (aka Heap LRU Eviction), it may create one or more severe errors in the logs for the "Evictor Thread" The issue is that the Evictor Thread uses the Region before it is fully constructed (see http://www.ibm.com/developerworks/java/library/j-jtp0618.html ). If this occurs the Heap LRU will never evict entries for the Region when there is no activity in the Region, yet it has data that could be evicted from the heap (essentially pinning data in the Region... again when there is no activity on the Region). Destroy the Region and then re-create it. Another option is to make sure the Region always has some entry activity which will ensures LRU eviction.
01/21/08 NotSerializableException can block cache access with global or d-ack scope 5.1 assigned NotSerializableException can block cache access with if occurring in a region with global or d-ack scope If the application tries to put a instance that isn't serializable into the cache it will block/hang the application and not recover if the region scope is global or d-ack. Add checks before any put or create operations that the object in question is an instance of java.io.Serializable.
07/13/07 DLockService is not completely Interrupt safe 5.1 assigned Interrupting threads using DistributedLockService causes other members to hang or generate large log files Some indications that this problem has occurred include statements in the log such as: "Grantor is still initializing" "Grantor creation was aborted but grantor was not destroyed" If these appear in the log, then a thread was interrupted while using the DistributedLockService and the member must be disconnected from the DistributedSystem. Other members may actually hang and possibly produce very large log files. Disconnecting this member from the DistributedSystem will allow other members to continue working without any further problems. Do not interrupt any thread that may be using the DistributedLockService API. Use waitTimeMillis to specify how long the lock request will wait. The thread will not continue to wait after the request times out. Disconnecting from the DistributedSystem will cause any waiting threads to return.
04/07/08 Gateway uses P2P reader thread to distribute and wait for ack causing deadlock 5.5 deferred Member hosting GatewayHub may deadlock if performing cache operations or hosting more than one GatewayHub The GatewayHub thread that distributes gateway events is the same thread that reads in messages. This provides guaranteed redundancy with secondary backups, but can result in deadlock if either member tries to perform cache operations or to host more than one GatewayHub. 1) Use -Dgemfire.gateway-queue-no-ack=true 2) Host only one GatewayHub in any given member and dedicate that member to hosting the GatewayHub. It should not perform cache operations or do anything other than feed the gateway.
08/11/10 DistributedSystem disconnect hang after NPE reported by VERIFY_SUSPECT.stop() 6.0 verifying Hang during shutdown In rare circumstances we have seen tests hang during shutdown after throwing a NullPointerException in VERIFY_SUSPECT.stop(). Killing the process that threw the exception will resolve the hang.
03/08/09 FileNotFoundException is logged for /tmp/agent.ser while running the Agent test. 6.0 verifying Failure to persist updated agent configuration causes FileNotFoundException Failure to persist agent configuration information causes the following warning to be logged without terminating the agent: "Encountered a java.io.FileNotFoundException while saving StatAlertDefinitions." All changes to the configuration are lost. An attribute 'canPersistStatAlertDefs' for AdminDistributedSystem MBean indicates whether the information could be persisted or not. Validate that the current working directory/ the -dir option has full write permissions for the user launching the agent. A boolean attribute 'canPersistStatAlertDefs' for AdminDistributedSystem MBean indicates whether the working directory has full write permissions for the user launching the agent.
02/12/07 Bridge client region.put() completes without exception, but entry value is not updated at the server 5.0 verifying Server's entry value is not updated although client region.put completes without exception This happens when operations are performed out of order on the server. The problem arises from this sequence of events: 1. A client attempts to put value X one or more times, but each attempt times out. 2. Each failed attempt "orphans" a thread on the server. 3. The client picks a new connection (and its associated server thread) and continues to perform its sequential updates (X+1, X+2, ... X+n). 4. The orphan threads are eventually scheduled and successfully perform the put with value X, overwriting the previous values (X+1 or X+2 or X+n). Disable timeout behavior for the BridgeLoader, BridgeWriter, or BridgeClient by setting its "readTimeout" parameter to zero. This causes all Region operations supported by the client to block until the server has finished with the operation, preserving client ordering. The "retryAttempts" configuration will still be used when there are communication failures with the server or when the server cache closes in the midst of the operation.